Compare commits

...

5761 Commits

Author SHA1 Message Date
2ce56de80e Remove few xfails 2025-05-22 15:18:11 -07:00
34cd5614c5 Fix lint 2025-05-22 15:16:53 -07:00
15d7f6ac2b clean up 2025-05-22 17:32:21 -04:00
7b80b3fd13 Apply suggestions from code review 2025-05-22 14:19:29 -07:00
b0b1902739 Update aten/src/ATen/native/mps/operations/Pooling.mm 2025-05-22 14:19:06 -07:00
1d29dc5d9c fix test_max_pool3d 2025-05-22 17:04:15 -04:00
fe518636a6 update 2025-05-22 16:33:09 -04:00
765dd32545 One is expected to return Tensor by reference from function 2025-05-22 13:16:58 -07:00
b9ca9918ba [BE] Do not call explicit constructor
Compiler should do the work for you
2025-05-22 13:16:18 -07:00
003540fcb6 Fix build 2025-05-22 13:12:38 -07:00
a7f788143e [MPS] Implement max_pool3d_with_indices 2025-05-22 15:59:53 -04:00
befb5bd52a [dynamic shapes] simplify int(x / y) pattern (#153477)
Fixes #138853

Summary: Converts `TruncToInt(IntTrueDiv(x / y))` to `x // y` if divisible, helps detect symint specializations where we didn't previously

Differential Revision: D74664734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153477
Approved by: https://github.com/bobrenjc93
2025-05-16 17:32:15 +00:00
3aa84775e7 [hipify] Replace cuda error cudaErrorContextIsDestroyed (#153576)
Summary: The cuda symbol the cuda symbol cudaErrorContextIsDestroyed is not converted to hipErrorContextIsDestroyed. Add this convertion

Test Plan: CI

Differential Revision: D74542735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153576
Approved by: https://github.com/xw285cornell, https://github.com/cyyever
2025-05-16 16:19:42 +00:00
a060f3d272 Rewrite autograd producer consumer stream sync logic (#151079)
Also see previous work https://github.com/pytorch/pytorch/pull/142097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079
Approved by: https://github.com/albanD
2025-05-16 15:42:22 +00:00
2ce0b66db8 [dynamo] Make OptimizedModule more robust in attribute reads and writes (#153637)
Fixes #138157.

Differential Revision: [D74834872](https://our.internmc.facebook.com/intern/diff/D74834872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153637
Approved by: https://github.com/williamwen42
2025-05-16 15:17:07 +00:00
f66a159db5 [Set] Raise TypeError if set is called with the wrong number of arguments (#152990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152990
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989, #152907, #152908
2025-05-16 14:28:32 +00:00
5a0ca65555 [Set] Add correct set/frozenset __init__ behavior (#152908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152908
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989, #152907
2025-05-16 14:28:32 +00:00
053025494f [Set] Raise KeyError on empty set.pop() (#152907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152907
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906, #152989
2025-05-16 14:28:32 +00:00
5964cb5eb1 [Set] Update set.union and set.update to support *args (#152989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152989
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905, #152906
2025-05-16 14:28:32 +00:00
4759922c5e [Set] Add set.intersection(_update) (#152906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152906
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903, #152905
2025-05-16 14:28:32 +00:00
ca96d55322 [Set] Add set.difference(_update) (#152905)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152905
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902, #152903
2025-05-16 14:28:32 +00:00
5c6830ced0 [Set] Raise KeyError if elem not contained in the set (#152903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152903
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901, #152902
2025-05-16 14:28:32 +00:00
574f4c507a [Set] Add set.issubset and set.issuperset (#152902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152902
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904, #152901
2025-05-16 14:28:32 +00:00
5926b7a38f [Set] Add set.symmetric_difference(_update) (#152901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152901
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988, #152904
2025-05-16 14:28:32 +00:00
fe51ce62ca [Set] Raise TypeError if number of arguments mismatch (#152904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152904
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987, #152988
2025-05-16 14:28:32 +00:00
481c345f49 [Set] Raise TypeError if argument is unhashable (#152988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152988
Approved by: https://github.com/anijain2305
ghstack dependencies: #150792, #152987
2025-05-16 14:28:32 +00:00
cf7021a0ee [Set] Handle exception in ConstantVariable operation (#152987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152987
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #150792
2025-05-16 14:28:32 +00:00
477f13c3fb [Set] Add CPython set tests (#150792)
Tests:
* test_set.py

This PR adds test_set.py from the CPython 3.13 branch and ~400 files to test/dynamo_expected_failures. Most of these are expected to be fixed in upcoming PRs. Only minimal changes were made to test_set.py to enable compilation with Dynamo using the PYTORCH_TEST_WITH_DYNAMO=1 environment variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150792
Approved by: https://github.com/anijain2305
2025-05-16 14:28:32 +00:00
6592086ac3 Add metal kernel for log ops (#153398)
Move unary log ops to metal kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153398
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-05-16 14:25:28 +00:00
8ca985b365 [Break XPU] Skip newly added test case on XPU that failed because torch._C._scatter not implemented. (#153685)
Fixes #153608
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153685
Approved by: https://github.com/malfet
2025-05-16 14:15:50 +00:00
9ccd601a14 [easy] Fix endif comments in functional_base.h (#153696)
The first one of these confused me on #152388. Happened to notice the second.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153696
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-16 14:08:41 +00:00
3443627e07 Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473)"
This reverts commit 4f4ecc583e0f48ad2d062a53bf91c61ab40b4948.

Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))
2025-05-16 08:29:26 +00:00
86c6f71ddb Revert "[Ez][BE]: Remove accidental classvar (#153540)"
This reverts commit e0dece510b703376d50a5d6536be6c601ca67d9e.

Reverted https://github.com/pytorch/pytorch/pull/153540 on behalf of https://github.com/jeanschmidt due to Broken internal tests, @albanD may you help the author get his PR merged? D74804063 ([comment](https://github.com/pytorch/pytorch/pull/153540#issuecomment-2886011101))
2025-05-16 08:26:37 +00:00
4d073af58c Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)"
This reverts commit 725bbb6b5fffa2f2d219a0692ed27e376c9dd48a.

Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/jeanschmidt due to seems to have broken a few internal tests, @jansel may you help the author get his PR merged? ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2885997862))
2025-05-16 08:20:39 +00:00
741539a790 Split out second pass of LayerNorm for profiler attribution reasons (#153578)
Summary:
Split out second pass of LayerNorm so it's more likely to show up in
profiler output. In my testing with perf, the samples from the lambda in the
current implementation are attributed somewhat haphazardly.

Differential Revision: D74181627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153578
Approved by: https://github.com/hl475
2025-05-16 08:07:13 +00:00
a9adc9a9b6 [Linter] Add linter to detect device-bias hard code in test cases. (#152948)
Since XPU does not gate community pull requests, we’ve observed that contributors often hardcode "cuda" in functions decorated with @requires_gpu() when adding new test cases. This causes the tests to fail on XPU and breaks XPU CI.
This PR adds a linter to detect such issues automatically. An example is shown below.

```
  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode device='cuda'

        11670  |                .contiguous()
        11671  |            )
        11672  |
    >>> 11673  |        inp = torch.rand((64, 64), device="cuda") * 2 - 1
        11674  |        boundaries = torch.tensor([-0.9, -0.8, 0.1, 0.2, 0.5, 0.9])
        11675  |
        11676  |        self.common(fn, (inp, boundaries), check_lowp=False)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode .cuda() call

        11700  |            self.assertEqual(ref, res)
        11701  |
        11702  |            for offset2 in (0, 1, 2, 3, 4):
    >>> 11703  |                base2 = torch.randn(64 * 64 + 64, dtype=torch.float32).cuda()
        11704  |                inp2 = torch.as_strided(base2, (64, 64), (64, 1), offset2)
        11705  |                ref2 = fn(inp2)
        11706  |                res2 = fn_c(inp2)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode torch.device('cuda:0')

        11723  |            return x.sin() + x.cos()
        11724  |
        11725  |        base = torch.randn(
    >>> 11726  |            64 * 64 + 64, dtype=torch.float32, device=torch.device("cuda:0")
        11727  |        )
        11728  |
        11729  |        inp1 = torch.as_strided(base, (32, 32), (32, 1), 4)

  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode .to('cuda') call

        11771  |            torch.manual_seed(42)
        11772  |            base = torch.randn(64 * 64 + 64, dtype=torch.float32, device=self.device)
        11773  |            torch.manual_seed(42)
    >>> 11774  |            base_ref = torch.randn(64 * 64 + 64, dtype=torch.float32).to("cuda")
        11775  |
        11776  |            inp = torch.as_strided(base, size, stride, offset)
        11777  |            inp_ref = torch.as_strided(base_ref, size, stride, offset)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152948
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/malfet, https://github.com/jansel
2025-05-16 08:03:54 +00:00
658d17dfb5 [ONNX] Add test for decomp_table update (#153671)
Added a test to strengthen the case for cherry-picking #153168. The original PR didn’t include this test since the fix for decomp_table and the registry was already covered by existing tests. However, it's reasonable to include a dedicated test for the specific issue (https://github.com/pytorch/pytorch/issues/150367 ) when considering the cherry-pick.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153671
Approved by: https://github.com/justinchuby
2025-05-16 08:00:16 +00:00
3fe42d4d5d [export] Dynamo symint support (#152677)
Basically adds native _IntWrapper support to dynamo. Here's my process of trying to make symint input support work on dynamo, and how I ended up with this approach [(doc)](https://docs.google.com/document/d/1GvNRQd8BnxlMay_hrEVgEta6VUeUW_hcFeRuB7q1nDY/edit?tab=t.0).

What I did was, before passing inputs to dynamo.export, I first wrap them with a class, `_IntWrapper`. When processing dynamic shapes, I will then add the corresponding dynamic shape specification to the `dynamism` field stored on the `_IntWrapper`. If there is no dynamism specified, then this will get unwrapped back to an integer. When dynamo tracing, when we encounter an `_IntWrapper`, we will convert this to a symint if the dynamism was specified as `Dim.DYNAMIC/AUTO`. Dynamo will then trace a graph that contains symint inputs, which will get passed to AOTAutograd and so on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152677
Approved by: https://github.com/pianpwk
2025-05-16 07:51:50 +00:00
d965fa2c4b [CUDA][cuBLAS] Remove IS_ARM64 skip in test_matmul_cuda.py (#153660)
Original skip seems stale and the test appears to run fine on Grace + Hopper and Grace + Blackwell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153660
Approved by: https://github.com/Skylion007
2025-05-16 07:31:16 +00:00
1503b3f897 [DSD] Don't pop tensors if they are on Meta device (#153185)
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.

This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
2025-05-16 07:18:39 +00:00
1a722f62c2 [Quant][X86] add an op to compute uint8 batch norm 2d (#152811)
**Summary**
This PR adds a new op, `onednn.qbatch_norm2d`, which accepts uint8 inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar performance as its counterpart for QuantizedCPU device `quantized.batch_norm2d`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_batch_norm_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152811
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #152411
2025-05-16 06:13:40 +00:00
7e16cb99b6 [FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)
Fixes #147336

## Context

NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.

Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.

To summarize:

In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.

This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).

i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.

## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs

## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.

Before fix:

```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```

After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-05-16 04:56:50 +00:00
459ce6c12a [export] Flatten frame local logs (#153627)
Summary:
Some new errors have been showing up on the PT2 dashboard with
```
Invalid type for lengths: Expected BlobReference or torch.Tensor, got: Tensor(shape: torch.Size([10]), stride: (1,), storage_offset: 0)
```
This is caused by [this piece of code](https://fburl.com/code/5nbi9on7) which maps over a set of nodes (in this case type `IDListFeatureListField`) and turns the results into strings to be displayed later. However during pytree.tree_map we call pytree.tree_unflatten which will call the class's init function, which calls `assert_blob` (https://fburl.com/code/h3ainrn9). Because we've mapped over the values and converted them to strings, the assert_blob fails.

I initially thought to disable the assert_blob while tracing (D74684309) but then I think we should actually flatten the list first. Because tlparse will expect just a string out outputs instead of the actual structure.

Test Plan: `buck2 run mode/opt sigmoid/inference/ts_migration:pt2i_readiness_main -- --test_suite ads_all --mode test_full_model --model_id 542947220` fails with something else 😅

Differential Revision: D74744326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153627
Approved by: https://github.com/yiming0416
2025-05-16 04:45:09 +00:00
7ed377f577 Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)" (#153656)
This reverts commit ae0e8f0c7316addab3f415dc767a9d34f58b0dae.

Keep android/libs/fbjni because it's being used by other components of
PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153656
Approved by: https://github.com/malfet
2025-05-16 04:35:42 +00:00
56e1c236bf [Dynamo] Catch unserialisable NN modules (#153503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153503
Approved by: https://github.com/c00w, https://github.com/jansel
2025-05-16 02:55:28 +00:00
d1f1ff8610 [ddp] propagate use_python_reducer to C++ reducer (#152735)
C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152735
Approved by: https://github.com/fegin
ghstack dependencies: #153300, #152689
2025-05-16 01:38:03 +00:00
1b4749f748 [ca][dtensor] run real PG dtensor tests under CA (#152689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689
Approved by: https://github.com/bdhirsh
ghstack dependencies: #153300
2025-05-16 01:38:03 +00:00
5aea57d653 [ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.

This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.

This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.

FIXES https://github.com/pytorch/pytorch/issues/127115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
2025-05-16 01:37:48 +00:00
cyy
9d3b6ee4c1 [submodule] Update gtest to v1.17.0 (#153618)
And remove some outdated CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618
Approved by: https://github.com/malfet
2025-05-16 01:24:19 +00:00
d1dd2c1fc8 gloo: cuda (#153406)
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.

This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441

Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.

The gloo submodule is updated to depend on the new Gloo changes

Test plan:

```py
import os
import time

transport = "TCP"
#transport = "IBVERBS"

os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)

ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")

device = "cpu"

iters = 10
warmup_iters = 2

for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
    t = torch.zeros(nelem, device=device)

    torch.cuda.current_stream().synchronize()
    for i in range(warmup_iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    start = time.perf_counter()

    for i in range(iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    dur = (time.perf_counter() - start)
    qps = iters/dur

    bandwidth_gb = t.nbytes * iters / dur / 1e9

    gb = t.nbytes / 1e9

    if rank == 0:
        print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
2025-05-16 01:13:13 +00:00
ab757dcddc [MPS][Testing] Add GoogleFnet, YituTechConvBert and Super_SloMo to benchmarks (#153658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153658
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/cyyever
ghstack dependencies: #153657
2025-05-16 01:09:31 +00:00
754b758ea1 [BE] Extend empty_gpu_cache to mps (#153657)
And replace `if: elif:` with `getattr()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-16 01:08:54 +00:00
2489b6470b [c10d] Allow split_group to work with non nccl backends (#152175)
Summary:
Currently things are hardcoded to only work with nccl backend. Extend it
to allow NCCL + custom plugin backend.

The split-specific methods/attributes have not been added to the base
Backend and Options as some of them are specific to backend implementations.
Instead, explicit checks have been added to the split_group method for the
expected methods and attributes.

I am open to making them part of base Backend based if folks prefer.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2025-05-16 00:15:29 +00:00
cb5f31a4a1 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-15 23:18:52 +00:00
e7a40fb301 [Async TP] Fix dim swapping before reduction in fused_scaled_matmul_reduce_scatter (#153595)
## Summary
- The unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` was not running for some reason when #149247 was merged, giving false green CI signals. When it was ran manually recently, the test failed, highlighting a bug causing incorrect numerics when `scatter_dim=1`.
- This PR fixes the bug, which was related to how we swap dims 0<=>scatter_dim at the beginning of the custom op (for more efficient cross-device data movement I believe), then swap it back prior to reduction.

## Test plan
- I confirmed the unit test `pytest test/distributed/test_symmetric_memory.py -k test_fused_scaled_matmul_reduce_scatter_scatter` is now passing.
- I confirmed e2e training w/ torchtitan looks good ([logs](https://www.internalfb.com/phabricator/paste/view/P1812054188))
- I analyzed the tlparse to verify the fused_all_gather_matmul and fused_scaled_matmul_reduce_scatter both appear at least once in the post grad graphs ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpVbUsdG/dedicated_log_torch_trace_65oh3qj_.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000))

## Next steps
1. I think for async TP `fused_scaled_matmul_reduce_scatter` we may only need `scatter_dim_after_maybe_reshape` and not `orig_scatter_dim` after all. I can confirm this and refactor if it is the case.
2. This op is specifically designed for async TP, and many of the arguments don't make sense for a user trying to use this as a standalone op. IMO we should have separate standalone custom op without all the extra function args and internal logic that doesn't apply to non-async TP cases.
3. In a follow up PR I want to add shape annotations to each line (e.g. `# (B, T, H)` etc) to make this easier to debug in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153595
Approved by: https://github.com/fegin
2025-05-15 21:44:57 +00:00
ea17cd067d Add vec_reduce_all specialization for std::plus on AArch64 (#152388)
AArch64 has an instruction for this.

Differential Revision: [D73817183](https://our.internmc.facebook.com/intern/diff/D73817183/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152388
Approved by: https://github.com/Skylion007
ghstack dependencies: #152365, #152366
2025-05-15 21:26:18 +00:00
b972435158 vec::map: directly process reduced-precision floats when reasonable (#152366)
The immediate motivation is to make map support match
ExecuTorch so we can delete ExecuTorch-specific mapping functions, but
this should also straightforwardly improve performance.

Testing: there is existing coverage for this in
vec_test_all_types.cpp. Verified that it really does cover the newly
enabled "don't convert through float" paths by temporarily adding a
TORCH_INTERNAL_ASSERT(false).

Differential Revision: [D73802126](https://our.internmc.facebook.com/intern/diff/D73802126/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152366
Approved by: https://github.com/malfet
ghstack dependencies: #152365
2025-05-15 21:26:18 +00:00
e4adf5df39 [ROCm] cpp_extension allow user to override default flags (#152432)
We need -fgpu-rdc for projects such as DeepEP + rocSHMEM. The default of -no-gpu-rdc doesn't work for such cases.

As per https://github.com/pytorch/pytorch/pull/152432#issuecomment-2840899088:
"rocshmem shares the same global variable in different files, as deepEP uses CUDAExtention to build the project 65e2a700f0/setup.py (L51) and depends on rocshmem, this -fgpu-rdc is needed. The current logic in Pytorch prevents users from overriding this flag."

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152432
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-15 21:06:18 +00:00
b8fad785d5 Change trigger for autoformat, use --all-files (#153289)
Change trigger for auto format to be pull_request b/c the reusable action used gets the pr number from the pull_request event context, but only run it if ciflow/autoformat is attached to the PR.  Tested this on a different PR, and it seems to be working

Changed tag name because ciflow prefixed labels have special handling

Also change to run on all files so it will mimic the normal CI lintrunner call, and because lintrunner, either by itself or using -m mergebase can miss some things.  Idk if it would miss for format, but it does for checking lint.  Format seems to take shorter than normal lint.  I don't know if the comment about making suggestions on non edited file changes is a concern.  I didn't really test this part

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153289
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-15 20:38:33 +00:00
90deff6d59 Refactor tests in test_max_autotune into a few separate test cases. (#153486)
Summary: To support running a subset of these tests with the remote autotuning utilities, I've split out some of the tests into separate classes so that I can derive from the "main" TestMaxAutotune class when creating new tests for remote. I'm not 100% sure what some of these tests do, so please suggest if another grouping / naming might make more sense. The remaining tests in TestMaxAutotune all smelled relevant to me.

Test Plan: existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153486
Approved by: https://github.com/eellison
2025-05-15 20:35:22 +00:00
a2e2f908fd add is_vec_specialized_for (#152365)
Let people detect at compile time whether Vectorized is specialized for a given type. See vec_base.h.

Differential Revision: [D73802129](https://our.internmc.facebook.com/intern/diff/D73802129/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152365
Approved by: https://github.com/jgong5, https://github.com/malfet
2025-05-15 20:21:48 +00:00
ae0e8f0c73 Revert "Delete TorchScript based Android demo app and point to ExecuTorch (#153633)"
This reverts commit b22f01fcb9d69bb7d77e08d69004c7265ef7fa4a.

Reverted https://github.com/pytorch/pytorch/pull/153633 on behalf of https://github.com/malfet due to But libtorch build regressions are real, fbjni is still used for C++ builds ([comment](https://github.com/pytorch/pytorch/pull/153633#issuecomment-2884951805))
2025-05-15 20:16:05 +00:00
b03e4f53d2 [Monitoring] enable windows monitoring test (#153453)
enable the utilization for win tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153453
Approved by: https://github.com/huydhn
2025-05-15 20:03:07 +00:00
f7ecc091a0 c10d/TCPStore: better logs on remote shutdown (#153586)
This makes it more obvious what's going on when TCPStore shuts down while waiting on a remote key and also shows the remote address.

Test plan:

```
[W514 18:33:36.536327028 TCPStore.cpp:138] [c10d] recvValueWithTimeout failed on SocketImpl(fd=3, addr=[localhost]:34658, remote=[localhost]:1234): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
```

```py
import os
rank = int(os.environ["RANK"])

import time
from torch import distributed as dist

store = dist.TCPStore(
    host_name="localhost",
    port=1234,
    is_master=(rank == 0),
    wait_for_workers=False,
)

time.sleep(1)

print("starting")

if rank != 0:
    store.get("foo")
else:
    time.sleep(1)

print("done")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153586
Approved by: https://github.com/XilunWu
2025-05-15 20:02:51 +00:00
064f4c18f9 [Monitoring] Enable perf tests (#153452)
Enable monitoring for more perf tests, currently for perf, we collect usage data every 4 seconds and aggregate every 15 seconds.

Can reduce the number down if the monitoring does not affect the perf testx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153452
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2025-05-15 19:19:19 +00:00
a4c828199e [BE] Add __all__ to torch/nn/functional.pyi and torch/return_types.pyi (#150729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150729
Approved by: https://github.com/aorenste
2025-05-15 19:01:57 +00:00
b22f01fcb9 Delete TorchScript based Android demo app and point to ExecuTorch (#153633)
Delete TorchScript demo app and point people to ExecuTorch demo app.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153633
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/atalman, https://github.com/janeyx99, https://github.com/seemethere
2025-05-15 18:43:59 +00:00
00e5cb3db3 [ez][trymerge] Edit revert message for reverted ghstack PRs (#153573)
Change comment about successful revert so it also contains info about the original PR that got the comment (if it is a ghstacked PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153573
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-15 18:23:20 +00:00
480ae2dab8 Add needs_contiguous_strides to more collective ops (#153523)
Differential Revision: D74705770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153523
Approved by: https://github.com/fmassa
2025-05-15 17:27:37 +00:00
cfee9046b6 cpu: enable gemm-bf16f32 for SDPA BF16 (#140159)
This PR enables SDPA BF16:  gemm:bf16f32 for aarch64.  This will enable faster inference for models with attention layers  for autocast mode (bf16).

Benchmark results from  [PyTorch CI HUD - branch](https://hud.pytorch.org/benchmark/huggingface/inductor_no_cudagraphs?dashboard=torchinductor&startTime=Fri%2C%2028%20Mar%202025%2021%3A26%3A20%20GMT&stopTime=Fri%2C%2004%20Apr%202025%2020%3A26%3A20%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/gemm_bf16f32&lCommit=d5aeab452e4b1f0580a4636b15a604c77a02c57b&rBranch=main&rCommit=bc72420bcb37390af3fced885e019903e6e425bd)
Overall Geometric mean speedup in HUD dashboard  : for Huggingface: `[0.48x → 0.58x]` and for Blueberries: `[0.88x → 1.13x]`

Benchmark numbers for `torch.nn.functional.scaled_dot_product_attention`on Neoverse™ V1.

`batch_size = 1, num_attention_heads = 64, sequence_length = 512, attention_head_size = 128`
 `threads=16`
<img width="319" alt="Screenshot 2024-12-20 at 16 23 22" src="https://github.com/user-attachments/assets/c863f97d-0761-4fb8-aa6c-fc67b22ac3f9" />

Script to benchmark & profile SDPA:

    import torch
    import torch.nn as nn
    import time
    import numpy as np
    from torch.profiler import profile, record_function, ProfilerActivity
    class SimpleAttentionModel(nn.Module):
        def __init__(self, query, key, value):
            super(SimpleAttentionModel, self).__init__()
            self.query = query
            self.key = key
            self.value = value

        def forward(self, attn_mask=None):
            torch.nn.functional.scaled_dot_product_attention(
                        self.query,
                        self.key,
                        self.value,
                        attn_mask=attn_mask)

    #batch_size = 1, num_attention_heads = 64, sequence_length = 512, hidden_size = 128
    def bench_sdpa(batch_size = 1, num_attention_heads = 64, sequence_length = 512, query_sequence_length = 128 , hidden_size=128, precision=torch.float32):
        with torch.no_grad():
            attention_head_size = int(hidden_size / num_attention_heads)
            query = torch.rand(size=(batch_size, num_attention_heads, query_sequence_length, attention_head_size), dtype=precision)
            key = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision)
            value = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision)

            model = SimpleAttentionModel(query, key, value)
            model.eval()
            for _ in range(10):
                model()
            times = []
            n_iters = 100
            for _ in range(n_iters):
                s = time.time_ns()
                model()
                times.append((time.time_ns() - s) / 1e3)
            min_times = np.min(times)
            mean_times = np.mean(times)
            print(f"Min Times = {min_times} us")
            print(f"Mean Times = {mean_times} us")
            print("Times = ", times)

    print("BF16 mode:")
    with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
        with record_function("model_inference"):
            bench_sdpa(precision=torch.bfloat16)
    profile_data = prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total")
    print(profile_data)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140159
Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/nikhil-arm, https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/cfRod, https://github.com/fadara01
2025-05-15 17:21:18 +00:00
236b08cbf8 Revert "[ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)"
This reverts commit 4863e5c843722eb2a34fb0ca1d518a33431a38c0.

Reverted https://github.com/pytorch/pytorch/pull/153300 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see fa8543454a/1 ([comment](https://github.com/pytorch/pytorch/pull/153300#issuecomment-2884489459))
2025-05-15 16:58:52 +00:00
2327c9eedc Revert "[ca][dtensor] run real PG dtensor tests under CA (#152689)"
This reverts commit b297e01f4b1f43ffd1769313f077a2a68928f012.

Reverted https://github.com/pytorch/pytorch/pull/152689 on behalf of https://github.com/malfet due to Looks like it breaks rocm, see fa8543454a/1 ([comment](https://github.com/pytorch/pytorch/pull/153300#issuecomment-2884489459))
2025-05-15 16:58:51 +00:00
db26aeaec2 [MPSInductor] Support numpy scalars handling (#153598)
By default, numpy computes results in float64 format, but when passed as an argument to MPS function, must be implicitly converted to float32, which naturally occurs in some networks, for example in speech_transformer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153598
Approved by: https://github.com/cyyever, https://github.com/dcci
ghstack dependencies: #153582
2025-05-15 16:48:25 +00:00
0cb48633d9 [ez][CI] Add linux aarch64 to upload test stats, change format of trigger for upload test stats (#153505)
Change from inline list to yml list
Add linux aarch64 for list of triggering workflows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153505
Approved by: https://github.com/Skylion007
2025-05-15 15:33:59 +00:00
fa8543454a [dynamo][torch-function] Prevent unnecessary __torch_function__ tracing (#153551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153551
Approved by: https://github.com/mlazos
2025-05-15 14:06:17 +00:00
4f4ecc583e [BE]: Enable RUFF TRY400 rule - log.exception (#153473)
Change logging.error to logging.exception to log additional information when relevant.  A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-15 13:36:59 +00:00
7482eb217c [Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops (#149373)
### Summary

Fixes #148494

Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension.

Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching.

\* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance

### Performance data with BS 1

Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine

| Model | input tokens | output tokens | next-token latency before this PR | Next-token latency after this change | Speedup |
|-----------|-------------|-----------------|--------------------------------------|------------------------------------------|-----------|
|GPT-J | 128 | 128 | 42 ms | 38 ms | 9.52 % |
| GPT-J | 1024 | 1024 | 48 ms | 45 ms | 6.25 % |
|LLaMA 3.1 8B Instruct | 128 | 128 | 52 ms | 47 ms|  9.61% |
|LLaMA 3.1 8B Instruct | 1024 | 1024 | 57 ms | 53 ms|  7.01% |

While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373
Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen

Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>
2025-05-15 11:55:58 +00:00
cyy
e5e06d9cab [submodule] Update kleidiai to v1.8.0 (#153592)
And cleans up some CMake instructions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153592
Approved by: https://github.com/malfet
2025-05-15 10:14:05 +00:00
22b124335e [BE] Update .pyi stub template to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150728)
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.

------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150728
Approved by: https://github.com/cyyever, https://github.com/aorenste
ghstack dependencies: #150726, #150727
2025-05-15 09:36:42 +00:00
f7a5aa1d8d [torchgen] Refactor and simplify gen_pyi.py to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150727)
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.

------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150727
Approved by: https://github.com/aorenste
ghstack dependencies: #150726
2025-05-15 09:36:42 +00:00
129a2976a8 [ROCm] Improvements to non-vectorized elementwise kernels (#153184)
* Unroll loops manually to hide memory access latency

Co-authors: @akadutta @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153184
Approved by: https://github.com/jeffdaily
2025-05-15 09:14:43 +00:00
6e107899da [Torch] Fix crash when comparing fp8 tensors that have more than 1 dimension (#153508)
Summary: `torch.nonzero` returns as many items as the number of dimensions, so we shouldn't expect a single element for the indices.

Test Plan: CI

Differential Revision: D74539233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153508
Approved by: https://github.com/exclamaforte
2025-05-15 08:41:46 +00:00
b297e01f4b [ca][dtensor] run real PG dtensor tests under CA (#152689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689
Approved by: https://github.com/bdhirsh
ghstack dependencies: #153300
2025-05-15 08:10:35 +00:00
4863e5c843 [ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.

This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.

This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.

FIXES https://github.com/pytorch/pytorch/issues/127115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
2025-05-15 08:10:35 +00:00
71027b13b2 Revert "[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)"
This reverts commit 881a598a1e38ef06d4f51d1e3fd8e359fed0c3a0.

Reverted https://github.com/pytorch/pytorch/pull/153357 on behalf of https://github.com/jeanschmidt due to Might have introduced regressions in rocm testing for main: https://github.com/pytorch/pytorch/actions/runs/15035410497/job/42257000513 feel free to re-merge if this was a mistake ([comment](https://github.com/pytorch/pytorch/pull/153357#issuecomment-2882915691))
2025-05-15 07:58:27 +00:00
004dad48f7 Allow to set custom PYTHONPATH for torch.inductor (#152832)
When using Bazel, it’s common to encounter issues like [this](https://github.com/bazelbuild/bazel/issues/14640) and [this](https://github.com/bazel-contrib/rules_python/issues/792) where the `PYTHONPATH` environment variable becomes too long and results in an error such as: `OSError: [Errno 7] Argument list too long` . To work around this, users often resort to custom logic to manipulate PYTHONPATH.

Currently, PyTorch Inductor constructs the PYTHONPATH for a subprocess using sys.path, which can lead to this issue in certain environments.

This PR introduces support for a new environment variable, `TORCH_CUSTOM_PYTHONPATH`, allowing users to override the default `PYTHONPATH` passed to the subprocess. This provides a clean way to avoid an exception when using PyTorch in Bazel.

Please let me know if I need to add some documentation to support this PR. I haven't found an open issue specific to this change but I'm confident that this change (or a similar one) would be appreciated by few.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152832
Approved by: https://github.com/masnesral
2025-05-15 06:35:41 +00:00
55784be01b [Quant][X86] add ops to compute uint8 pointwise add/add_relu (#152411)
**Summary**
This PR adds two new ops, `onednn.qadd.tensor` and `onednn.qadd_relu.tensor`, for int8 elementwise add, which accepts inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.add` and `quantized.add_relu`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_add_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152411
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-05-15 06:23:01 +00:00
a762dd1f67 [Memento] On-demand mode using without torch api (#153171)
Summary:
CUDA Post: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/2020094788475989/

# Context
In this diff, we want to enable the on-demand mode of memory snapshot to allow user to trace any remote process via dyno command line.

# Design decision

**How do we send on-demand signal to remote process**
We leverage the dyno-Kineto approach.
Since dyno is running on all machine in Meta, it can send a request to the remote machine to start the Kineto.
Kineto will start another thread for memoryProfiler (https://fburl.com/code/dxsmmrok)

**why we use different approach as CUDA**

On CUDA side, we are using pybind to load torch Module and invoke the python api to start/stop the profiling. However, this requires us to compile the whole torch binary in the predictor which is not recommended by runtime(andruwang)

Thus, we decide to use the CPP api directly to avoid un-necessary dependency

**why the snapshot is saved as json string directly instead of pickle**
Pickle is primarily designed for use with Python and doesn't have well support in cpp. Also, it is hard for user to download the snapshot file and open locally.
Due to the dependency issue, it is hard to import the gzip/pickle library to decode the data. Thus, let's use JSON for now. I will work on the visualizer to fasten the render and support other format later.

**Plan**:
* Now, we will encoded file into gz for MTIA ondemand only and update the visualizer to support both type.
* Update auto-trace and CUDA side to encode in gzip as well
* Fully remove pickle dependency.

Test Plan:
# Remote cogwheel test
Servicelab: https://fburl.com/servicelab/pckux7a3
snapshot file manifold: https://fburl.com/manifold/fnotk18c
snapshot file in pastry: P1805522232

Visualization on D74399684
 {F1977786422}

# Local Predictor Test
url: https://fburl.com/pytorch_memory_visualizer/y06kskkm

 {F1977787329}

Differential Revision: D74179606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153171
Approved by: https://github.com/sraikund16
2025-05-15 06:07:04 +00:00
181bfabb9e fix set_logs for a single child log file (#153580)
Tested via

```
+        import logging
+        torch._logging.set_logs(modules={"torch._functorch._aot_autograd.autograd_cache": logging.DEBUG})
```

```
python test/dynamo/test_aot_autograd_cache.py -k test_multi_graph_specialization
```
and verifying logs are printed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153580
Approved by: https://github.com/ColinPeppler
2025-05-15 05:58:45 +00:00
9839ec1383 [dynamo][compile-time] Cache method on load builtin (#153524)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153524
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #153522
2025-05-15 05:54:15 +00:00
b47be23461 [dynamo][compile-time] Faster inspect getattr_static for torch.Tensor (#153522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153522
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-05-15 05:54:15 +00:00
910d2f96af [cutlass backend] forward fix cutlass backend A100 test (#153428)
Forward fix of https://github.com/pytorch/pytorch/pull/153006, which broke a test.

In the long run, we should get rid of CUDATemplateCaller.category.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153428
Approved by: https://github.com/ColinPeppler
2025-05-15 05:45:38 +00:00
0ca91af6b8 Define USE_C10D_XCCL and USE_XCCL in pytorch (#147593)
### Motivation:

Add `USE_XCCL` and `USE_C10D_XCCL` to enable support of XCCL backend building in stock PyTorch, similar to `USE_NCCL` and `USE_C10D_NCCL`.
 By default, `USE_XCCL` is OFF and allowed set to ON explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147593
Approved by: https://github.com/guangyey, https://github.com/malfet, https://github.com/albanD, https://github.com/cyyever
2025-05-15 05:39:00 +00:00
ebd3268538 Removed duplicate patterns from gitignore (#153515)
Removed duplicate patterns from gitignore. These patterns are duplicated verbatim on lines 148-169.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153515
Approved by: https://github.com/soulitzer
2025-05-15 05:38:42 +00:00
b992a665d1 Fix AsyncMM not compiled with SM90a issue (#153519)
The CMakeLists.txt is wrong and doesn't enable SM90a for AsyncMM.cu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153519
Approved by: https://github.com/drisspg, https://github.com/ngimel, https://github.com/cyyever
2025-05-15 05:23:29 +00:00
d5ddc5ab20 [MPS] Fix float64 scalar tensor handling (#153582)
Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor

Add regssion test for this behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582
Approved by: https://github.com/dcci
2025-05-15 05:15:14 +00:00
3e8bda4ad5 [pytorch][triton] flex attention fwd kernel with TMA loads (#151923) (#152460)
Summary:

Device side TMA for flex_attention fwd kernel, Q K V tensors

Test Plan:
Unit test:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- test_tma_with_customer_kernel_options
```
https://www.internalfb.com/intern/testinfra/testrun/14355223891618726

Differential Revision: D71082691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152460
Approved by: https://github.com/drisspg
2025-05-15 04:49:32 +00:00
756fd80734 [BE] Improve the typing related to model input argument of torch.compile() (#153559)
Summary: Match the `overload` typing with the original typing in function definition and adjust the corresponding comments.

Test Plan: contbuild & OSS CI

Differential Revision: D74746243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153559
Approved by: https://github.com/Skylion007
2025-05-15 04:49:26 +00:00
d2f6c6df1d unbreak fb:operator_benchmark_test (#152049)
Summary: unbreak fb:operator_benchmark_test

Test Plan: works on my machine

Differential Revision: D73540912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152049
Approved by: https://github.com/hl475
2025-05-15 03:38:48 +00:00
014726d9d3 [torchgen] Refactor torchgen.utils.FileManager to accept pathlib.Path (#150726)
This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support.

This allows us to simplify the code such as:

1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`.

95a5958db4/torchgen/utils.py (L155)

95a5958db4/torchgen/utils.py (L176)

2. `os.path.basename(...)` with `Path(...).name`.
 95a5958db4/torchgen/utils.py (L161)

3. Manual file extension split with `Path(...).with_stem(new_stem)`

95a5958db4/torchgen/utils.py (L241-L256)

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726
Approved by: https://github.com/aorenste
2025-05-15 02:52:24 +00:00
881a598a1e [FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)
Fixes #147336

## Context

NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.

Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.

To summarize:

In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.

This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).

i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.

## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs

## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.

Before fix:

```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```

After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-05-15 02:41:38 +00:00
eaf2dee10e don't run triton mm for k<32 (#153550)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153550
Approved by: https://github.com/suo

Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
2025-05-15 02:36:44 +00:00
725bbb6b5f [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel
2025-05-15 02:33:57 +00:00
f5e0806f34 [cutlass backend] Add back descriptive names for epilogue fusion (#153405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153405
Approved by: https://github.com/mlazos
2025-05-15 01:47:52 +00:00
82dc3457e0 Add load_state_dict hint doc about invoke order work with lr_scheduler (#149942)
Fixes #119168

## Test Result

![image](https://github.com/user-attachments/assets/edb8124c-f103-475a-b903-20fbc71fdea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149942
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-05-15 01:07:36 +00:00
cyy
781ba0ac9d Update CMake to 3.27 in Windows CI (#153380)
Before it's possible to use enable newer CMake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153380
Approved by: https://github.com/albanD
2025-05-15 00:19:32 +00:00
c2bc7e2827 API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536)
Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more.

For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t

Error we see without the change
```
RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))`

To execute this test, run the following from the base repo dir:
    python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150536
Approved by: https://github.com/jcaip, https://github.com/atalman
2025-05-14 23:36:53 +00:00
72fee137dd [ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727
Approved by: https://github.com/seemethere

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-05-14 22:34:55 +00:00
e0dece510b [Ez][BE]: Remove accidental classvar (#153540)
Untyped variables become ClassVar in dataclasses, this type alias should just be a type alias; no need for it to eb a classvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153540
Approved by: https://github.com/albanD, https://github.com/aorenste
2025-05-14 21:55:56 +00:00
7412b33e91 [inductor] Use get to avoid possible keyerror at the end of precompilation (#153417)
Shameful admission: I have encountered this error 1-2 times, but don't have a repro.

torch/_inductor/select_algorithm.py", line 2022, in wait_on_futures
    elapsed_times[future],
    ~~~~~~~~~~~~~^^^^^^^^
torch._inductor.exc.InductorError: KeyError: <Future at 0x7fc4e394fb90 state=finished returned tuple>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153417
Approved by: https://github.com/Skylion007, https://github.com/ColinPeppler
2025-05-14 21:49:43 +00:00
f2e8e41855 [Easy][Inductor] Adds safety checks in get_estimated_runtime (#152821)
This PR adds checks on `gpu_memory_bandwidth` and `gpu_flops` in `get_estimated_runtime`. This will prevent division by zero and other potential incorrect values:
9210a98b92/torch/_inductor/scheduler.py (L864-L865)

9210a98b92/torch/_inductor/scheduler.py (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152821
Approved by: https://github.com/eellison, https://github.com/jansel
2025-05-14 21:46:59 +00:00
f887bfffda Fix typo (#153561)
Fix typo from #153386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153561
Approved by: https://github.com/albanD
2025-05-14 21:38:51 +00:00
03d01860fd [dynamo][compile-time] Compute logging related flags once (#153426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153426
Approved by: https://github.com/jansel
2025-05-14 21:19:06 +00:00
1bd6bc7190 [BE]: Enable ruff YTT linter for Python version checks (#153547)
Adds ruff YTT checks to help future proof version checks and follow best practices here. Also makes it easier for static linters like mypy to detect python version branching.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153547
Approved by: https://github.com/albanD
2025-05-14 21:09:16 +00:00
f363a3f51a Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 9386701b51aadce951bf38daf497b0257a3f2211.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))
2025-05-14 20:53:49 +00:00
c92ea3bc98 [BE] Upgrade XPU support package to 2025.1 in CICD (#151899)
Address #151097. Including below changes,

- Add XPU support package 2025.1 build and test in CI for both Linux and Windows
- Keep XPU support package 2025.0 build in CI to ensure no break issue until PyTorch 2.8 release
- Upgrade XPU support package from 2025.0 to 2025.1 in CD for both Linux and Windows
- Enable XCCL in Linux CD wheel and oneMKL integration in both both Linux and Windows
- Update XPU runtime pypi packages of CD wheels
- Remove deprecated support package version docker image build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151899
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-05-14 20:21:09 +00:00
5e6e52e7c9 [JIT] add GRAPH_DEBUG for setGraphExecutorOptimize (#153549)
Summary: Optionally log when setGraphExecutorOptimize is called, so we can get insight into the GraphExecutor behavior.

Differential Revision: D74692508

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153549
Approved by: https://github.com/PaulZhang12, https://github.com/SamGinzburg
2025-05-14 20:07:25 +00:00
dda2c7c8fc Pass inductor config for static cuda launcher to workers (#153382)
Async compile workers don't respect inductor configs generally that get changed in the middle of execution because they warm up early. StaticCudaLauncher is especially susceptible to this because it affects triton compilation without being part of the inductor meta. So we'll pass it in via extra configs on each worker run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153382
Approved by: https://github.com/masnesral, https://github.com/jansel
2025-05-14 20:01:32 +00:00
6a28cc826f Add TEST_HPU flag to set device type (#153461)
MOTIVATION
This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66

CHANGES
add TEST_HPU flag to some of the conditions checking the environment
use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi.
@ankurneog , @EikanWang , @cyyever , @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153461
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD
2025-05-14 19:31:40 +00:00
a54bf43baa Fix support of MixtureSameFamily [bugfix]. (#151317)
Fixes https://github.com/pyro-ppl/pyro/issues/3419 which is actually a `torch` bug that can be replicated by the below code:

```
from torch import rand
from torch.distributions import MixtureSameFamily, Categorical, Binomial

max_count = 20
probs = rand(10, 5)
binom_probs = rand(10, 5)

d = MixtureSameFamily(Categorical(probs=probs), Binomial(max_count, binom_probs))
d.log_prob(d.sample())
```

which results in:

```
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    d.log_prob(d.sample())
  File "pytorch\torch\distributions\mixture_same_family.py", line 168, in log_prob
    self._validate_sample(x)
  File "pytorch\torch\distributions\distribution.py", line 315, in _validate_sample
    valid = support.check(value)
            ^^^^^^^^^^^^^^^^^^^^
  File "pytorch\torch\distributions\constraints.py", line 307, in check
    (value % 1 == 0) & (self.lower_bound <= value) & (value <= self.upper_bound)
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1
```

### Fix explanation (only for cases when the component distribution contains parameters with batch dimenisons)

- The failure is due to sample validation taking place before padding in `MixtureSameFamily.log_prob`, and hence the fix is to pad before doing sample validation.
- The fix itself does not alter the calculations at all. It only affects the sample validation process.
- The failure does not occur with the component distribution set to the `Normal` distribution, as its validation is not defined elementwise (the validation itself is elementwise).
- I've split the `test_mixture_same_family_log_prob` test into two tests based on the `Normal` and `Binomial` distributions.
- Initially, the `Binomial` version of the test did not fail, but this was due to the component distribution having equal batch dimensions of (5, 5) so I changed it to (10, 5).

### Updated fix explanation (for all cases)

- The previous fix caused a bug in sample shape validation (which is done correctly) due to the padding taking place before the sample validation.
- The updated fix corrects the support to reflect the fact that the support of `MixtureSameFamily` is equal to the support of its components distribution with the first event dimension removed.
- This issue was already anticipated in the [code](331423e5c2/torch/distributions/mixture_same_family.py (L127)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151317
Approved by: https://github.com/albanD, https://github.com/fritzo
2025-05-14 19:24:36 +00:00
clr
534b66fe30 torch.compile: Remove reference to the unused dynamo_config.dynamic_shapes from (#153297)
tests

This config option is not set anywhere, and does nothing, so this should cause
no changes to tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153297
Approved by: https://github.com/Skylion007
2025-05-14 19:02:51 +00:00
bf0fe4f828 Revert "[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)"
This reverts commit ced90d23d3dfff42379fa032fe6a83b764d12e9f.

Reverted https://github.com/pytorch/pytorch/pull/153101 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages on main, tentative revert: https://github.com/pytorch/pytorch/actions/runs/15024667248/job/42224521705 ([comment](https://github.com/pytorch/pytorch/pull/153101#issuecomment-2881208171))
2025-05-14 18:52:07 +00:00
8749fe8439 [CI][MPS] Speedup test_large_bmm (#153562)
By computing matmuls of only one random non-zero batch on CPU

This reduces test runtime from 11 minutes to 14 sec
```
 % python3 test/test_mps.py -v -k test_large_bmm_
test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok
test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok

----------------------------------------------------------------------
Ran 2 tests in 27.495s

```

TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562
Approved by: https://github.com/Skylion007, https://github.com/clee2000
2025-05-14 18:49:42 +00:00
47d6feff7c [export] Support no inputs in unflattened module (#153474)
Encountered in this diff D74589491
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153474
Approved by: https://github.com/avikchaudhuri
2025-05-14 18:45:47 +00:00
6ef1cbc191 Revert "[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)"
This reverts commit e6a90672601ad3d636145dd8a68952281a6d1199.

Reverted https://github.com/pytorch/pytorch/pull/151727 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal builds, @seemethere may you help the author? [D74729252](https://www.internalfb.com/diff/D74729252) ([comment](https://github.com/pytorch/pytorch/pull/151727#issuecomment-2881122917))
2025-05-14 18:18:17 +00:00
533fc58453 [BE]: Fix typing None override other optimizers (#153386)
Follow up to #153367 to fix other instances of it throughout the codebase

Also fully type NamedOptimizer since we were so close

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153386
Approved by: https://github.com/tsunghsienlee, https://github.com/janeyx99, https://github.com/jansel, https://github.com/cyyever
2025-05-14 17:48:47 +00:00
2362bd4a4c [Torch][NT] Fix NestedTensor contiguous check condition. (#153237) (#153529)
Fixes #153237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153529
Approved by: https://github.com/jbschlosser
2025-05-14 17:15:48 +00:00
8bb67700a3 [dynamo] Support delattr on result of torch.compile(module) (#152741)
This is essentially a follow-up on #122098, where we added support of
`getattr` and `setattr` on result of `torch.compile(module)`, but didn't
add support for `delattr`.

Fixes #150711.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741
Approved by: https://github.com/anijain2305
ghstack dependencies: #152740
2025-05-14 17:03:59 +00:00
6765df052c [dynamo] Emit warning on global module hooks when calling using output of torch.compile(module) (#152740)
When we do `torch.compile(module)`, we eventually end up returning a new
`OptimizedModule` instance, whose `forward` method is the result of
`torch.compile(mod.__call__)`, meaning it already captures all the extra
logic (e.g., hook firing) for the compiled module.

`OptimizedModule` also inherits `nn.module.__call__`, and thus
has its own hook logic. This is useful for torchao, which injects module
forward hooks to run in eager for quantization purposes.

However, this might create unexpected behavior for global module hooks,
because `torch.compile(module)` causes the hook to fire one extra time
for `OptimizedModule`, when compared to eager.

To preserve BC, we simply emit a warning for this behavior, and let
users decide what to do. This is reasonable because the global module
hooks are documented to be used for debugging/profiling purposes only.

Fixes #149502

Differential Revision: [D74611716](https://our.internmc.facebook.com/intern/diff/D74611716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-05-14 17:03:59 +00:00
b3dea0c0dd Change aoti cpp tests to run serially within file (#152960)
Fixes #152674
https://github.com/pytorch/pytorch/issues/152889
https://github.com/pytorch/pytorch/issues/152888
https://github.com/pytorch/pytorch/issues/152891

`--dist=loadfile` ensures all tests in the same source file run in the same worker.

Tests like `FreeInactiveConstantBufferRuntimeConstantFoldingCuda` expect exclusive access to memory during test time to compute diffs (e.g., initMemory - updateMemory2 == DATASIZE).

With `-n 3`, tests run in separate processes, but CUDA device memory is shared — and cudaMemGetInfo() reads device-wide global state.

```
 python test/run_test.py --cpp --verbose -i cpp/test_aoti_inference -dist=loadfile
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152960
Approved by: https://github.com/desertfire, https://github.com/cyyever
2025-05-14 17:02:39 +00:00
ba70876407 Update lint_urls.sh (#153246)
Treat 403, 429 and 503 http errors as success.
Ignore non-verbal hostnames.
Kill child jobs immediately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153246
Approved by: https://github.com/malfet
2025-05-14 16:54:49 +00:00
b6b0080419 [DCP] Use multiprocess Pipes instead of Queues to improve communication contract with checkpointer process (#153488)
Summary:
### Diff Context
- PR introduces Pipes for multiprocess comms with checkpointer process.
- Pipes allow easier comms contract management due to close() API and catch-all feature when background process is dead (e.g. seg faults).

Test Plan: CI

Differential Revision: D74668559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153488
Approved by: https://github.com/saumishr
2025-05-14 16:47:43 +00:00
8799bffc34 [BE][Ez]: RUF200 - validate pyproject.toml metadata (#153543)
Since we have pyproject.toml metadata for [project] and [build-requires], let's turn on the linter rules which validates this optional metadata to make sure it's properly formatted and follows the correct schema for standard Python build tools.

Right now, incorrect metadata could silently error with how our CI is invoked or only provide warnings for invalid metadata. This check will help surface those errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153543
Approved by: https://github.com/albanD
2025-05-14 16:42:22 +00:00
7d39e73c57 Fix more URLs (#153277)
Or ignore them.
Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277
Approved by: https://github.com/malfet
2025-05-14 16:23:50 +00:00
de92296bbb [Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976)
Fix https://github.com/pytorch/pytorch/issues/152290.

The model **hubert** uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN.  This PR handles this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg
2025-05-14 16:09:03 +00:00
1f48bab377 Update torch-xpu-ops commit pin (#153445)
Update the torch-xpu-ops commit to [207105038963e5f9f012f1a0cfd3b9f57b2ab5b0](2071050389), includes:

- Improve the accuracy of `upsample_bilinear2d_backward`
- Enhance the performance of `avg_pool2d`
- Update the implementation of scatter-gather and indexing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153445
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-05-14 15:34:47 +00:00
2e440e39a6 [nativert] Move Placement to pytorch core (#152953)
Summary:
Move Placement to pytorch core.

Using `torch::nativert::isSameDevice` explicitly in code to avoid confusion with the `isSameDevice` in torch namespace.

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:placement_test

./bin/test_nativert
```

OSS and internal CI

Differential Revision: D74190745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152953
Approved by: https://github.com/Skylion007, https://github.com/swolchok, https://github.com/zhxchen17, https://github.com/cyyever
2025-05-14 15:26:54 +00:00
eqy
ced90d23d3 [CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)
For #152816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-14 15:22:47 +00:00
0ce941f994 [audio hash update] update the pinned audio hash (#153507)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153507
Approved by: https://github.com/pytorchbot
2025-05-14 15:16:35 +00:00
cd119ddd7c Add matching against hypothetical (new) ghstack pull-request trailer (#153528)
I would like to change ghstack to use a new trailer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153528
Approved by: https://github.com/malfet
2025-05-14 14:07:01 +00:00
8f3d7972ad [dynamo][compile-time] Cache the function signature to speedup inlining (#153396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #153333
2025-05-14 14:01:46 +00:00
2344eca5eb Revert "Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315)"
This reverts commit ee096b89f63394b2c18826288783eef241f3959c.

Reverted https://github.com/pytorch/pytorch/pull/151315 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal regressions, see [D74668899](https://www.internalfb.com/diff/D74668899). @malfet may you help the author get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/151315#issuecomment-2880203323))
2025-05-14 13:15:03 +00:00
2c1912452d Revert "Rewrite autograd producer consumer stream sync logic (#151079)"
This reverts commit f78e4529a9d446deb77c6ac38184582f6ab9167a.

Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))
2025-05-14 13:07:12 +00:00
a628efd1e8 Revert "Enable accelerator to perform streaming backward (#153412)"
This reverts commit d5d26ce43641a19c3e36a751b59b7fa3825cea83.

Reverted https://github.com/pytorch/pytorch/pull/153412 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/151079 ([comment](https://github.com/pytorch/pytorch/pull/153412#issuecomment-2880169739))
2025-05-14 13:04:27 +00:00
e8f7a97e2e [Refactor] Explicilty spell out the namespace for device() function (#153248)
Summary: To prepare for the coming up header-only file change. The same files have been using a mixed style of using at::device() and device(). Given these .cpp files are not in the at namespace, it makes sense to spell them out explicitly.

Differential Revision: [D74577412](https://our.internmc.facebook.com/intern/diff/D74577412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153248
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/janeyx99
2025-05-14 12:00:47 +00:00
0ef5ba43a6 Fix negative dim issue in for parallel loss context manager (#152785)
Facing similar issue as on #152016  , and added as per @tianyu-l 's solution.
Fixes #152016

 Tagging @tianyu-l @atalman  for review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785
Approved by: https://github.com/tianyu-l
2025-05-14 10:43:27 +00:00
864a5f4434 [dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
2025-05-14 09:26:26 +00:00
0139ce9303 Add skip_dtype_check_in_meta_registrations config to torch/fx/experimental/_config (#153513)
Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking.

Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513
Approved by: https://github.com/jansel
2025-05-14 09:14:11 +00:00
4015166e5d [ROCm] Maxpool backward NHWC Perf Improvement targeting Resnet scenarios (#152267)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152267
Approved by: https://github.com/jeffdaily
2025-05-14 06:59:29 +00:00
4c5cf18ee0 [device_mesh] improve device selection logic (#150897)
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
2025-05-14 06:29:16 +00:00
0f891cad5a Enable ruff check for torch/utils/data/*.ipynb (#148654)
Fixes part of #146411

Enable ruff check for `torch/utils/data/*.ipynb` files

## Test Result

```bash
lintrunner -a --take RUFF torch/utils/data/*.ipynb
```

![image](https://github.com/user-attachments/assets/88fddc91-3f19-4704-9aef-2cabd2cdc96e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148654
Approved by: https://github.com/Skylion007
2025-05-14 06:21:47 +00:00
f7798d8645 Checks kv pair indexing in OrderedPreservingDictTest.test_range_insert (#148136)
`OrderedPreservingDictTest.test_range_insert` has an [unused loop variable `j`](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L186), I think taken from the [inspired project](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L165) testcase for range inserts, where it [checks kv pair indexing/order](https://github.com/Tessil/ordered-map/blob/master/tests/ordered_map_tests.cpp#L136) for the ordered dict.

This just adds in that functionality to the test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148136
Approved by: https://github.com/eellison
2025-05-14 06:05:23 +00:00
11c64b7cf8 [dynamo][compile-time] Cache whether a function is inlineable (#153192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #153458
2025-05-14 05:40:25 +00:00
e2ce17c6ef [SymmMem][a2av] Use more CTAs for intra-node case (#153509)
Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth.

This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism.  The kernel now achieves 350 GB/s SOL for Hopper. See figure.

It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8)

For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s).

![all_to_all_vdev Performance on 8xH100](https://github.com/user-attachments/assets/d4b841e6-4c42-4a2e-aa9f-2bc116ba9d25)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509
Approved by: https://github.com/ngimel
ghstack dependencies: #153483
2025-05-14 04:24:32 +00:00
20dbe644c7 [CD] Fix the libgomp twice load issue (#150084)
Fixes #149422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150084
Approved by: https://github.com/malfet, https://github.com/leslie-fang-intel, https://github.com/atalman

Co-authored-by: LifengWang <lifeng.a.wang@intel.com>
2025-05-14 04:06:18 +00:00
316c15297c [MemoryZ] Show the current and max entries rendered (#153446)
Summary: as title

Test Plan: {F1977904091}

Differential Revision: D74626081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153446
Approved by: https://github.com/sraikund16
2025-05-14 03:16:12 +00:00
c797f1285c [dynamo][copmile-time] Handle builtins first in LOAD_GLOBAL (#153458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153458
Approved by: https://github.com/jansel
2025-05-14 03:04:38 +00:00
33a5179269 [AOTI][reland2] Remove typedef for half and bfloat16 (#153467)
Summary:
Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues.

typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen.

Differential Revision: D74398762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever
2025-05-14 02:37:18 +00:00
9ad9a04ca7 Add TensorLR variant for fused Adagrad on CPU (#153078)
This PR adds a tensor LR variant for the CPU Adagrad(fused=True).

I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078
Approved by: https://github.com/janeyx99
2025-05-14 02:23:33 +00:00
d51bc27378 [export] Make draft_export public (#153219)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153219
Approved by: https://github.com/pianpwk
2025-05-14 02:18:36 +00:00
b15b870903 [BE] remove outdated torch/README.md (#153500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153500
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-14 02:10:30 +00:00
d759a517af Update the heuristic for AArch64 bmm/baddbmm (#149122)
Updates heuristic for bmm/baddbmm and consolidates all heuristic logic in a single location

 - The goal of the consolidation is to improve maintainability and readability of the heuristic logic. Instead of different parts scattered across two files, this patch centralizes everything inside `Matmul.cpp`, where there already exists heuristic-based selection for mkldnn.
 - The logic of the check itself doesn't change (existing code is reused where possible) but a separate heuristic threshold for bmm/baddbmm is introduced based on newer, benchmarking data. Use the script below to see the performance improvement for bmm from the new heuristic:
 ```
import torch
import time

# Set below to True to use cases selected by only one of the hueristics.
USE_ONLY_DIVERGENT_TEST_CASES = True
BATCH_SIZES  = [ 1, 8, 32, 64, 128, 256 ]
M_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
N_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
K_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
ITERS = 50

def old_heuristic(m, n, k):
     is_above_min_dims = m > 8 and n > 8 and k > 8
     is_above_min_size = m*n*k > 8_192
     return is_above_min_dims and is_above_min_size

def new_heuristic(b, m, n, k):
     return b*b*m*n*k >= 4_194_304

def generate_test_cases():
    test_cases = []
    for b in BATCH_SIZES:
        for m in M_DIMS:
            for n in N_DIMS:
                    for k in K_DIMS:
                        if USE_ONLY_DIVERGENT_TEST_CASES:
                            if old_heuristic(m, n, k) != new_heuristic(b, m, n, k):
                                test_cases.append([b, m, n, k])
                        else:
                            test_cases.append([b, m, n, k])
    return test_cases

def test(x, y):
    for _ in range(5):
        torch.bmm(x, y)
    perf = 0.0
    for _ in range(ITERS):
        start = time.time()
        torch.bmm(x, y)
        end = time.time()
        perf += (end - start) / ITERS
    return perf

def main():
    print(f"{'b':<10}{'m':<10}{'n':<10}{'k':<10}{'time (s)':10}")
    cumulative_mean_time = 0.0
    for b, m, n, k in generate_test_cases():
        mean_time = test(torch.rand(b, m, n), torch.rand(b, n, k))
        cumulative_mean_time += mean_time
        print(f"{b:<10}{m:<10}{n:<10}{k:<10}{mean_time:10.3e}")
    print(f"Cumulative mean time = {cumulative_mean_time:.4f} s")

if __name__ == "__main__":
    main()
```

From the script we see that cumulative mean time from all test cases (at 16 threads) is:
 - 1.6195 s for the old heuristic
 - 0.7012 s for the new heuristic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149122
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-05-14 02:03:50 +00:00
e8662e836a Remove std::is_arithmetic specialization from c10/util/strong_type.h (#153424)
Specializing std::is_arithmetic has undefined behavior (and breaks builds with -Winvalid-specialization). Should fix #150901

Differential Revision: [D74614724](https://our.internmc.facebook.com/intern/diff/D74614724/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153424
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-14 02:01:32 +00:00
clr
85f97b5a8c compile_fx: make a compile event that corresponds to the fx_compile waitcounter (#152983)
This is a pretty minor change, but by having exact correspondence, we can
easily confirm data differences between perfetto and wait counters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152983
Approved by: https://github.com/jansel, https://github.com/masnesral
2025-05-14 01:54:42 +00:00
90001554bf [SymmMem][a2av] Fix TODO: change stride unit (#153483)
Previous kernel impl assumes float type. This PR makes it general by passing stride in unit of bytes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153483
Approved by: https://github.com/fegin, https://github.com/ngimel
2025-05-14 01:47:54 +00:00
eqy
9386701b51 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
2025-05-14 01:39:24 +00:00
8521a690f7 [dynamo] fix potential circular import error in decorators.py (#153217)
Differential Revision: [D74442043](https://our.internmc.facebook.com/intern/diff/D74442043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153217
Approved by: https://github.com/jansel
2025-05-14 01:01:57 +00:00
e6a9067260 [ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727
Approved by: https://github.com/jeffdaily

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-05-14 00:58:00 +00:00
7f79222992 Upgrade to NCCL 2.26.5 for CUDA 12 (#152810)
Upgrade NCCL to latest 2.26.5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152810
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/cyyever
2025-05-14 00:52:50 +00:00
8739a8c288 elastic: do not shutdown rendezvous on leaving workers (#152525)
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](fa6f9eb2be/torch/distributed/launcher/api.py (L290)) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749).

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525
Approved by: https://github.com/kiukchung
2025-05-14 00:44:10 +00:00
8ac82c3e72 [export] support functools.partial forward (non-strict) (#153408)
Fixes #153086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153408
Approved by: https://github.com/tugsbayasgalan
2025-05-13 23:30:13 +00:00
40b719c97d [nativert] move executor config to torch (#153087)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff moves the executor config to torch. since it's header-only this requires some changes to the libtorch build configs

Test Plan: CI

Differential Revision: D74278789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153087
Approved by: https://github.com/zhxchen17
2025-05-13 23:26:00 +00:00
3498201e57 GPU lowering uses aoti_call_delegate (#153282)
Summary: Skip custom objects when serializing the weight nodes of `aoti_call_delegate` hop as they are not consumed by the runtime.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D73704385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153282
Approved by: https://github.com/dolpm, https://github.com/SherlockNoMad
2025-05-13 23:23:27 +00:00
81719ebde3 [caffe2] Make c10::str works with scoped enum (#152705) (#152714)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152705

Test Plan:
```
buck2 test fbcode//caffe2/c10/test:util_base_tests --fail-fast
```

Differential Revision: D74087796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152714
Approved by: https://github.com/Skylion007
2025-05-13 21:05:36 +00:00
e8596c291b Fix misleadingly high AOT Inductor dashboard performance (#153060)
Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good.

An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x.

Only two benchmarks we regularly run are affected by this, both in the TIMM set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060
Approved by: https://github.com/desertfire
2025-05-13 20:59:59 +00:00
a13c8f2ecb [EZ/Profiler] Replace manual GIL calls with pybind GIL calls (#153415)
Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock

Test Plan: Induced error manually and saw that GIL was released

Differential Revision: D74593564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-13 20:47:52 +00:00
5ff2cb8587 Add justknobs for static cuda launcher (#153400)
Summary:
This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off.

It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile.

Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet.

Differential Revision: D74599203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400
Approved by: https://github.com/oulgen
2025-05-13 20:10:13 +00:00
clr
20ba8fe7e6 induct: Log a pt2 compile event + waitcounter for node fusing. (#153270)
This appears to be slow in production (potentially a quadratic explosion), and
logging this explicitly in pt2_compile_events and wait_counters makes it a lot easier to see how
bad of an issue this is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153270
Approved by: https://github.com/masnesral
2025-05-13 19:02:36 +00:00
8ac82a1d20 [dynamo] Add test to ensure we don't print fx graph upon data dependent graph break (#153416)
This adds a regression test for #149831, also as part of getting it
cherry-picked into 2.7.1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153416
Approved by: https://github.com/atalman
2025-05-13 18:28:02 +00:00
9df9d9ded0 [device_mesh] replace dim_group_info with group_name (#150898)
as titled, there's no need to maintain a dim_group_info anymore, we can
simply maintain a list of group_name instead. This will simplify the
logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-05-13 17:16:45 +00:00
9c3cef437c gloo: support ibverbs in cmake (#153425)
This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch.

Test plan:

```
sudo dnf install rdma-core-devel
USE_GLOO_IBVERBS=ON python setup.py develop
torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
```

```py
"""
run with:

torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
"""

import os

os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

if rank == 0:
    device = "cpu"
else:
    device = "cuda"

print(device)

t = torch.full((10, 100), fill_value=(rank+1), device=device)
target = torch.full((10, 100), fill_value=3, device=device)

dist.all_reduce(t)

torch.testing.assert_close(t, target)

t = torch.full((10, 100), fill_value=(rank+1), device=device)

if rank == 0:
    dist.send(t, dst=1)
else:
    dist.recv(t, src=0)
    torch.testing.assert_close(t, torch.full_like(t, 1))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425
Approved by: https://github.com/fduwjj
2025-05-13 17:09:00 +00:00
dde705864a Fix test broken by D73809989 (#153413)
Summary: I forgot to remove this unused field in D73809989.

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413
Approved by: https://github.com/c00w
2025-05-13 16:44:30 +00:00
216e28f7e9 [ca] run xfails up until their last passing backend (#153279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153279
Approved by: https://github.com/jansel
ghstack dependencies: #153193, #153222
2025-05-13 16:42:10 +00:00
a80eb84a5f [ca] support higher order gradients (create_graph=True) (#153222)
Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager").

Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222
Approved by: https://github.com/jansel
ghstack dependencies: #153193
2025-05-13 16:42:09 +00:00
37efaf4af9 [ca][api] config api shouldn't error with optimize_assert (#153193)
Toggling on `torch._dynamo.config.compiled_autograd = True` was erroring export (optimize_assert didn't have `rebuild_ctx` defined). Separately add a way to `rebuild_ctx` for `optimize_assert` since it is a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153193
Approved by: https://github.com/jansel
2025-05-13 16:42:02 +00:00
a4459cd4e3 Remove property from python_type function (#152900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152900
Approved by: https://github.com/amjames, https://github.com/anijain2305
ghstack dependencies: #153070
2025-05-13 16:26:25 +00:00
f67eb6f8c5 Fix path matching in CPythonTestCase/setUpClass (#153070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153070
Approved by: https://github.com/amjames, https://github.com/anijain2305, https://github.com/Skylion007
2025-05-13 16:26:25 +00:00
c5ebc12f7f [ROCm] unkip test_non_standard_bool except for failings ops (#152956)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152956
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-05-13 15:55:42 +00:00
445d8fd77d [MemoryZ] Sync changes to internal page (#153166)
Summary:
For MTIA on-demand mode, since we are not using torch Module. The data upload happens in cpp and doesn't support pickle.
Thus, we store as JSON at the end and need the update visualizer to support it

Test Plan: Check Test plan in D74179606

Differential Revision: D74406209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153166
Approved by: https://github.com/sraikund16
2025-05-13 15:35:10 +00:00
ea3eaf68bf Fix AOTI cpp tests (#153423)
`Error in dlopen: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.30 not found` error  was caused by cmake migration (as conda one probably have some extra link rules), while `C++ exception with description "CUDA error: no kernel image is available for execution on the device` were caused by the fact that test were build for Maxwell, but run on SM_86

Remaining test was failing before, but was probably disabled
TODOs:
 - Move build to the build step

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153423
Approved by: https://github.com/huydhn, https://github.com/cyyever
2025-05-13 15:25:03 +00:00
6b02e60838 [Intel GPU] Use user-friendly err msg in mm (#151655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151655
Approved by: https://github.com/EikanWang
2025-05-13 15:13:21 +00:00
7fdd754136 [compile-time traces] Profile large missing gaps in compile time (#151256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256
Approved by: https://github.com/bdhirsh, https://github.com/masnesral, https://github.com/zou3519, https://github.com/jansel
2025-05-13 14:44:51 +00:00
ee096b89f6 Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151315
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-13 14:44:17 +00:00
d9ef1012db [PP] Optimize memory usage by releasing output memory earlier (#153383)
Considering `output_chunks` is only used for last stage, we should not keep the outputs of each stage in memory; this will allow memory to be freed earlier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153383
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-05-13 14:42:38 +00:00
f1de3f9f07 Rename "output_tensor" -> "out" in autotune_process.py (#153169)
Summary: This change is to support remote autotuning. I want to use all the same benchmarking utilities in select_algorithm.py. For remote autotuning, I'll reuse the TritonBenchmarkRequest class used for subprocess autotuning because it's already serializable. That class is also used in standard, in-process autotuning, but via TritonTemplateCaller.benchmark() which sets the output_tensor param when calling the underlying TritonBenchmarkRequest. For remote, I'll be using the TritonBenchmarkRequest request directly so I want the parameter to be named 'out' to avoid "got an unexpected keyword argument 'out'".

Test Plan: Existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153169
Approved by: https://github.com/aorenste, https://github.com/eellison
2025-05-13 14:18:29 +00:00
9f98e37eb4 [Intel GPU] add tf32 support for matmul on XPU (#144240)
Support xpu tf32 matmul using torch.bachend.mkldnn.allow_tf32, we will discuss in future if we need a new api to control matmul only
~~Support xpu tf32 matmul using torch.set_float32_matmul_precision. For conv, check https://github.com/pytorch/pytorch/pull/137570
We decide not following torch.backends.cuda.matmul.allow_tf32 because this API actually calls setAllowTF32CuBLAS to set matmul_precison to high. We also avoid other related tf32 changes (i.e. in inductor) by not introducing new API.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144240
Approved by: https://github.com/EikanWang
2025-05-13 14:03:01 +00:00
ff039d39ec [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-13 12:17:59 +00:00
d0faa9985d [Dynamo] Fix typing in graph_deduplication.py (#152572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152572
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570
2025-05-13 12:17:59 +00:00
a415c9831f [Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506
2025-05-13 12:17:59 +00:00
57dafb90ef [Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410
2025-05-13 12:17:59 +00:00
118192011e [Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505
2025-05-13 12:17:59 +00:00
3592cb52d9 [Hierarchical Compilation] Use universal flatten APIs (#152505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389
2025-05-13 12:17:59 +00:00
023a3dc69f [Hierarchical Compilation] Track node mutations (#152389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389
Approved by: https://github.com/anijain2305
2025-05-13 12:17:59 +00:00
edc2d539d1 torch.tensordot: performance improvements when contracting to a scalar. (#145936)
As per title.
Fixes https://github.com/pytorch/pytorch/issues/145731

Touches only compute. The CPU overhead can potentially be further reduced.

Before:
```python
In [3]: n = 512

In [4]: A = torch.rand(n, n)

In [5]: B = torch.rand(n, n)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After
```python
In [2]: n = 512

In [3]: A = torch.rand(n, n)

In [4]: B = torch.rand(n, n)

In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936
Approved by: https://github.com/albanD, https://github.com/ngimel
2025-05-13 10:57:30 +00:00
8d7dec6e92 Revert "[DSD] Don't pop tensors if they are on Meta device (#153185)"
This reverts commit 7243c69421cd0b868f3fa3b552c17e9c8b3023a1.

Reverted https://github.com/pytorch/pytorch/pull/153185 on behalf of https://github.com/jeanschmidt due to Seems to break internal signals, see [D74577069](https://www.internalfb.com/diff/D74577069) ([comment](https://github.com/pytorch/pytorch/pull/153185#issuecomment-2875662357))
2025-05-13 09:13:27 +00:00
cyy
9785b32189 Remove unused typing-extensions BUCK target (#153229)
This target is unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153229
Approved by: https://github.com/colesbury
2025-05-13 04:29:59 +00:00
cyy
15e08f9571 [submodule] Update ONNX to 1.18 (#152200)
Update ONNX to 1.18.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152200
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-05-13 04:18:45 +00:00
c4fb0b6f33 refresh expected results (#150166)
@huydhn when do you think we will have the APIs to access results on oss storage available so we do not
have to worry about this racing again?
Also is there a way to accelerate unstability in this after we land it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150166
Approved by: https://github.com/bobrenjc93, https://github.com/eellison, https://github.com/anijain2305
2025-05-13 04:04:42 +00:00
483bbb639a [CI] Collect accuracy for MPS inductor benchmarks (#153443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153443
Approved by: https://github.com/atalman
2025-05-13 03:49:28 +00:00
36722c287f [cutlass backend] make compile name independent of command (#153388)
Differential Revision: D74291603

The goal is to reuse the kernels as much as possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153388
Approved by: https://github.com/ColinPeppler
2025-05-13 03:49:24 +00:00
29c8ae825f [OpenReg] Move SDPA to OpenReg from open_registration_extension.cpp (#153309)
As the title stated.

**Next Chages**:
- Migrate remaining functionality to OpenReg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153309
Approved by: https://github.com/albanD
2025-05-13 03:49:19 +00:00
a6c5b59067 [MPSInductor] Fix multistage reduction suffixes (#153362)
By invalidating all variable created during the loop except for the context of iterator_cache, as storage can be done inside reduction loop and clear `IteratorRangeEntry` codegen cache.

Which results in the following kernel for `x / x.sum()` if x size is 2048 and max thread group size is 1024
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device half* out_ptr1,
    constant half* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[32];
    float tmp_acc_1 = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        auto tmp0 = static_cast<float>(in_ptr0[r0_0]);
        tmp_acc_1 += tmp0;
    }
    auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, 1024);
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        auto tmp2 = static_cast<float>(in_ptr0[r0_0]);
        auto tmp3 = tmp2 / tmp1;
        out_ptr1[r0_0] = static_cast<half>(tmp3);
    }
}
```

Fixes compilation report reported while running `GPUTests.test_pattern_matcher_multi_user_mps` and `GPUTests.test_weight_norm_bwd_mps`

Fixes https://github.com/pytorch/pytorch/issues/152155

Though inductor tests are still failing, need to keep refining the variable invalidation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153362
Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/jansel
2025-05-13 03:07:53 +00:00
27e9d9b103 [c10d][fr] Add try catch to update entry due to cuda error (#153414)
During the dump of FR, due to some unknown reasons, we see cuda errors when querying events and this leads to the failures of whole FR dumps (when trying to get entries). So we do a try-catch instead of let it fails the whole process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153414
Approved by: https://github.com/d4l3k
2025-05-13 01:10:00 +00:00
8b507a9809 convert guard_size_oblivious to runtime check in infer_size_impl (#148872)
its ok to check the requirement  numel == newsize at runtime in case of unbacked instead of at compile time and assume that its true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148872
Approved by: https://github.com/bobrenjc93
2025-05-13 00:32:28 +00:00
0cf61ca7e4 make use_mem_pool threadlocal (#153356)
Partial fix for #152861, makes allocation to pool thread-local, but doesn't touch the second bug where multiple threads allocating to multiple pools error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153356
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-13 00:16:07 +00:00
d5d26ce436 Enable accelerator to perform streaming backward (#153412)
Also see https://github.com/pytorch/pytorch/pull/142097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153412
Approved by: https://github.com/albanD
ghstack dependencies: #151079
2025-05-13 00:02:24 +00:00
71c8231742 fix bug with TORCHINDUCTOR_DUMP_LAUNCH_PARAMS (#153066)
Summary:
https://fb.workplace.com/groups/1028545332188949/posts/9503194033132340/?comment_id=9504669536318123&reply_comment_id=9506405459477864&notif_id=1746154132646897&notif_t=work_group_comment_mention

Aligns the arguments for the triton inputs

Differential Revision: D74085173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153066
Approved by: https://github.com/jansel
2025-05-12 23:56:49 +00:00
641e4bee67 Revert "[export][cond] support merging constant ints as unbacked symint (#152742)"
This reverts commit a805911d15f0da0b3b07203d5cb727e84ef40cf0.

Reverted https://github.com/pytorch/pytorch/pull/152742 on behalf of https://github.com/ydwu4 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/152742#issuecomment-2874410372))
2025-05-12 23:06:33 +00:00
a87e810980 add needs_contiguous_strides tag (#153399)
Summary:
The padding operations could lead to non-contiguous tensors, which will fail the test in `reduce_scatter_tensor`: https://fburl.com/code/5wt5xkig

The `needs_contiguous_strides` tag is to tell inductor that `reduce_scatter_tensor` needs contiguous inputs, so it will not to execute padding operations.

Test Plan:
W/o the tag, job failed on the check:
https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_check_256bs_8t-fc398c39d3?job_attempt=0&version=0&tab=summary&env=PRODUCTION

With this tag, previously failed job succeeded:
https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_i10_tag-2ed5b05276?job_attempt=11&version=0&tab=summary&env=PRODUCTION

Differential Revision: D74598810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153399
Approved by: https://github.com/fmassa
2025-05-12 23:03:56 +00:00
f05b38aa26 [BE]: Improve decorator typing for Optimizer subclasses (#153374)
Improves typing so that all the optimizer subclasses (which all of them that subtype step) do not erase their type signature when this decorator is used. Now *kwarg values and returns will propogate

This complements @tsunghsienlee PR #153367  as the type signature of step() was being erased on all the optimizer subclasses by this untyped decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153374
Approved by: https://github.com/janeyx99, https://github.com/tsunghsienlee
2025-05-12 22:55:25 +00:00
b0f2891e43 [AOTInductor] Fix clang-tidy warnings in wrapper (#153197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153197
Approved by: https://github.com/desertfire
2025-05-12 22:35:59 +00:00
3ff22fe2df [BE]: Use shutil.which in inductor codegen (#153377)
Use shutil.which instead of subprocess. Is more secure, has better error handling and is more cross platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153377
Approved by: https://github.com/albanD
2025-05-12 22:11:26 +00:00
dbb4444ce3 [Memento] Add PT2 to Memory Snapshot (#152707)
Summary:
To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following:

1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack
2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected
3. Piping for compile context to pickle output

Test Plan:
In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658}

Differential Revision: D74028214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707
Approved by: https://github.com/eqy
2025-05-12 21:12:51 +00:00
f78e4529a9 Rewrite autograd producer consumer stream sync logic (#151079)
Also see previous work https://github.com/pytorch/pytorch/pull/142097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079
Approved by: https://github.com/albanD
2025-05-12 21:07:16 +00:00
f136046919 Clean up right nav (#153090)
- Move community and language binding links to the horizontal bar
- Add an intro to the community page.
- Fix the link in the ogp_image
- Fix the link in the version switcher
- Clean up unneeded links

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153090
Approved by: https://github.com/albanD
2025-05-12 21:00:45 +00:00
a805911d15 [export][cond] support merging constant ints as unbacked symint (#152742)
@pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)) with the following pattern:
```python
idx = if u0 return 0 else return 1
return  x[idx]
```
We could preserve the conditional with a cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742
Approved by: https://github.com/zou3519
2025-05-12 20:26:31 +00:00
88a068f33b [2/n][Optimus][Auto-AC] Support activation quantization with scaling (#151770)
Summary:
Previously, we only support non-scaling quantization, which may lead to overflow, here we support scaling quantization, and set it as the default version.

Here, we quantize activation nodes based on the size_in_mb, the default value is 100, i.e., as long as the node has at least 100MB size, we will quantize it.

Test Plan:
### how to enable

```
    torch._inductor.config.post_grad_fusion_options = {
        "activation_quantization_aten_pass": {
            "quant_type": "torch.float8_e5m2", -> default is this type to quantize, you can change the type
            "use_scaling": False,  -> default is False, if you want to use scaling verison, set it to True
            "size_in_mb": 0.0,  -> default is 100, you can tune the value.
             "exclude_primals": False, -> whether want to exclude quantize parameters, default is False
              "allowed_dtypes": "torch.float16;torch.bfloat16;torch.float32", -> dtype you consider to quant, use ";" to separate, default is torch.bfloat16
        },
    }
```

### toy model

```
buck2 run mode/opt //scripts/qyz/autoac:quantization
```

```
Epoch [80/200], Loss: 19227.2109
Epoch [100/200], Loss: 1353.5272
Epoch [120/200], Loss: 38630.6758
Epoch [140/200], Loss: 6239.9155
Epoch [160/200], Loss: 6039.1567
Epoch [180/200], Loss: 3994.3569
Epoch [200/200], Loss: 146.3966
```

Differential Revision: D73015996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151770
Approved by: https://github.com/Mingming-Ding
2025-05-12 19:43:18 +00:00
45df18dcd0 [BE]: Enable ruff rule TC007 (#153394)
Enables [TC007] https://docs.astral.sh/ruff/rules/unquoted-type-alias/#unquoted-type-alias-tc007 this finds type aliases that should be quoted if they have to interact with IF TYPE_CHECKING blocks: https://docs.astral.sh/ruff/rules/unquoted-type-alias/#unquoted-type-alias-tc007

Disabled it when we updated RUFF, but really should only have disabled TC006 as that is the one that is going to cause some changes codebase wide.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153394
Approved by: https://github.com/albanD
2025-05-12 19:18:29 +00:00
fb85ebd710 [BE]: Use undocumented temp shim to restore setuptools compat (#153052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153052
Approved by: https://github.com/albanD
2025-05-12 18:33:41 +00:00
3555ebb63d [BE]: Update ruff to 0.11.8 (#153249)
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere
2025-05-12 18:30:52 +00:00
5c3fddb9cc Revert "[Hierarchical Compilation] Track node mutations (#152389)"
This reverts commit c2936ebfd58be7a6519f51d165dfac8407020140.

Reverted https://github.com/pytorch/pytorch/pull/152389 on behalf of https://github.com/jeanschmidt due to Humm, interesting, there seems to be a bug in stack PRs, as it should be part of the stack and be reverted with the other ones ([comment](https://github.com/pytorch/pytorch/pull/152389#issuecomment-2873540451))
2025-05-12 18:18:44 +00:00
e1d03fa251 [Inductor] Optimize grid calculation by using // instead of FloorDiv (#153230)
https://github.com/pytorch/pytorch/pull/146942 introduced an 8.3% regression on the `benchmark_torchbench_run_bert_pytorch_training:defaults-speedup-x1000` perf metric. This was flagged by internal CI testing (task T223596372).

The root cause seems to be that `FloorDiv` is now used to calculate the launch grid in certain scenarios, which is slower than the previously-used `//`. Since launch grid calculations happen at runtime, they can have a significant performance impact on some models.

The reason for switching to `FloorDiv` in https://github.com/pytorch/pytorch/pull/146942 was to allow the FX backend to generate runnable Python code. `FloorDiv(x, y)` maps to `x // y` in Python, whereas `sympy.floor(sympy.Rational(x,y))` maps to `floor(x/y)`, which crashes as FX doesn't know what `floor` is.

To get the best of both worlds, this PR reverts to using `//` to calculate launch grids, but then post-processes the resulting sympy expressions in the FX converter, converting `floor(x / y)` to `FloorDiv(x, y)`. Since this sympy manipulation happens at compile time, the perf impact should minimal, and should only affect the FX backend. This is similar to the approach previously explored in https://github.com/pytorch/pytorch/pull/151144, but the implementation is more minimal and self-contained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153230
Approved by: https://github.com/jansel
2025-05-12 18:08:52 +00:00
498f364518 Fix test_fused_scaled_matmul_reduce_scatter when scatter_dim is 0 (#153286)
The function signature of fused_scaled_matmul_reduce_scatter was changed. This PR fixes the function signature. However when scatter_dim is 1, the two outputs are not close. We need a followup on this.

Another followup is to change fused_scaled_matmul_reduce_scatter to make those newly added arguments optional. Users shouldn't need to these arguments if they don't flatten the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153286
Approved by: https://github.com/kwen2501
2025-05-12 17:38:49 +00:00
7e1790d86b [xla hash update] update the pinned xla hash (#153368)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153368
Approved by: https://github.com/pytorchbot
2025-05-12 17:11:23 +00:00
dc47295dc5 [Inductor UT][Break XPU] Generalize newly added device-bias code in Inductor UT. (#153355)
Fixes #153123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153355
Approved by: https://github.com/desertfire, https://github.com/Skylion007
2025-05-12 15:53:05 +00:00
ea4b65ab60 Fix the type hint of step() with default value (#153367)
Summary: Because the default value of `closure` is `None`, this fixes the situation when `step()`. The previous typing (https://github.com/pytorch/pytorch/pull/102593) could only be used as `step(closure=None)` and `step(None)`.

Test Plan: contbuild & OSS CI

Differential Revision: D74560785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153367
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/janeyx99
2025-05-12 15:52:59 +00:00
de5c5f4fb7 Opt-out LF runners from of inductor jobs (#153151)
Opt-out of inductor jobs for the lf experiment configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153151
Approved by: https://github.com/seemethere
2025-05-12 15:52:53 +00:00
89aa6eb19b Stop codegen-ing post_grad_custom_pass in repros (#153243)
When codegen'ed, it looks like:
```py
post_grad_custom_pass = <object at 0x12345678>
```
Which is not runnable at all. Some logic is also trying to deepcopy the
object, and not all of these objects are deepcopy-able.

This PR skips codegenning of these passes.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153243
Approved by: https://github.com/houseroad
2025-05-12 15:21:11 +00:00
7657d80a58 [aoti] when generating example input shapes, use unbacked replacements (#153220)
## Context
Suppose we have this graph like this :
```
a: "[s1 + u2, 200]"
b: "[u0, 32]"
cat: "[s1 + u2, 232]" = torch.cat([a, b], dim=1)
```

NOTE: torch.cat assumes "all tensors must either have the same shape (except in the concatenating dimension) or be a 1-D empty tensor with size (0,)."

So, we would expect u0 = s1 + u2 which is guarded on today except it's a deferred runtime assertion since unbacked symints aren't replaced today as Pian.

Notice how a  has a different symbolic shape than both b and cat. Today, this will create an unexpected shape mismatch when AOTI autotunes. Here's a rough illustration where 8192 is the unbacked symint fallback value.

```
# s1 is an arbitrary integer
a = generate_example_value(size=(s1 + 8192, 200))
b = generate_example_value(size=(8192, 32))
out = generate_example_value(size=(s1 + 8192, 232))
triton_cat.run(a, b, out ...)
```

## Error
```
wrapper.py:1484: <module>: block: [443,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
wrapper.py:1484: <module>: block: [443,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.

RuntimeError: CUDA error: device-side assert triggered
```

Differential Revision: [D74485962](https://our.internmc.facebook.com/intern/diff/D74485962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153220
Approved by: https://github.com/desertfire
2025-05-12 15:20:57 +00:00
1c659b5bc0 [BE]: Use more portable shutil.which call for cpp_builder (#153325)
We should be using shutil.which instead of calling some binary subprocess here for portability and security.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153325
Approved by: https://github.com/xuhancn, https://github.com/cyyever, https://github.com/albanD
2025-05-12 15:15:21 +00:00
78d752e96a Revert "[Hierarchical Compilation] Use universal flatten APIs (#152505)"
This reverts commit f9e3a9058e80fde310e5f0919d3a21e28cd024a8.

Reverted https://github.com/pytorch/pytorch/pull/152505 on behalf of https://github.com/jeanschmidt due to [TENTATIVE] reverting to check if reverting this stack partially caused the introduction of https://github.com/pytorch/pytorch/actions/runs/14966121510/job/42049638969#step:22:875 ([comment](https://github.com/pytorch/pytorch/pull/152505#issuecomment-2872869990))
2025-05-12 14:48:08 +00:00
cb35a2b15d Add missing in-place on view check to custom autograd.Function (#153094)
Fixes https://github.com/pytorch/pytorch/issues/152773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153094
Approved by: https://github.com/albanD
ghstack dependencies: #153005
2025-05-12 14:42:46 +00:00
a67dd2083c [dynamo] Guard serialization for SHAPE_ENV (#153258)
Differential Revision: [D74483150](https://our.internmc.facebook.com/intern/diff/D74483150/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153258
Approved by: https://github.com/jansel
ghstack dependencies: #153255, #153256, #153257
2025-05-12 14:42:01 +00:00
e2f6870c98 [dynamo] Guard serialization for DEFAULT_DEVICE (#153257)
Differential Revision: [D74483147](https://our.internmc.facebook.com/intern/diff/D74483147/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153257
Approved by: https://github.com/jansel
ghstack dependencies: #153255, #153256
2025-05-12 14:42:00 +00:00
ef1dcc21ee [dynamo] Guard serialization for global state guards (GRAD_MODE, DETERMINISTIC_ALGORITHMS, TORCH_FUNCTION_STATE, FSDP_TRAINING_STATE) (#153256)
serialization for global state guards.

Differential Revision: [D74483149](https://our.internmc.facebook.com/intern/diff/D74483149/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153256
Approved by: https://github.com/jansel
ghstack dependencies: #153255
2025-05-12 14:41:53 +00:00
0210986cc4 [dynamo] Guard serialization for EMPTY_NN_MODULE_HOOKS_DICT (#153255)
EMPTY_NN_MODULE_HOOKS_DICT

Differential Revision: [D74483148](https://our.internmc.facebook.com/intern/diff/D74483148/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153255
Approved by: https://github.com/jansel
2025-05-12 14:41:44 +00:00
daca611465 Revert "[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)"
This reverts commit 5683965f02c4091a864484917f74e3a42c9c56ae.

Reverted https://github.com/pytorch/pytorch/pull/151727 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151727#issuecomment-2872361816))
2025-05-12 12:29:28 +00:00
8511d21081 Revert "Forward fix #151727 (#153306)"
This reverts commit 64518ca7420271562c4920c13c44221c54e534df.

Reverted https://github.com/pytorch/pytorch/pull/153306 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153306#issuecomment-2872339570))
2025-05-12 12:22:13 +00:00
23ecd35a96 Update slow tests (#151207)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151207
Approved by: https://github.com/pytorchbot
2025-05-12 12:05:58 +00:00
47df195065 Revert "[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)"
This reverts commit bc8b305eb816106de31602f8b7fd80d4113e6ee8.

Reverted https://github.com/pytorch/pytorch/pull/152410 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
0e36887209 Revert "[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)"
This reverts commit 779e647999645d19eebf01fa686fb792176f8940.

Reverted https://github.com/pytorch/pytorch/pull/152506 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
53ebcabb52 Revert "[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)"
This reverts commit 50df08eb5e4d9276b72929fd859ad892880bab0f.

Reverted https://github.com/pytorch/pytorch/pull/152570 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
0071fdab9e Revert "[Dynamo] Fix typing in graph_deduplication.py (#152572)"
This reverts commit 15166be691454f8a0e626b54b6be0bea51938f86.

Reverted https://github.com/pytorch/pytorch/pull/152572 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
aa7fe6af41 Revert "[Dynamo] Optimize dedupe region ancestor tracking (#152589)"
This reverts commit b5f1345f72ec6d1b004b05284e9553e65ee03abc.

Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
7243c69421 [DSD] Don't pop tensors if they are on Meta device (#153185)
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.

This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
2025-05-12 07:04:59 +00:00
032ef48725 [BE]: Add PEP621 project section to pyproject.toml (#153055)
Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling.

* Enables modern tooling like uv sync and better support for tools like poetry.
* Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version).
* Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments.
* This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels

These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging.

Without this, `uv sync` complains:
```
error: No `project` table found in: `pytorch/pyproject.toml`
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055
Approved by: https://github.com/ezyang
2025-05-12 02:16:07 +00:00
ceb009baee [map] always turn on dynamo for map (#152041)
Summary:
X-link: https://github.com/pytorch/executorch/pull/10409

Reland D72896450

Make map consistent with other control flow ops. After the change, map is able to support accessing closures in the map fn.

Test Plan: See existing tests.

Reviewed By: zou3519

Differential Revision: D73138427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152041
Approved by: https://github.com/zou3519
2025-05-12 02:10:08 +00:00
c5b4dc9898 [executorch hash update] update the pinned executorch hash (#152238)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152238
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-05-12 01:50:12 +00:00
930de01861 [Typing] Apply torch.types.Device in torch/cuda/memory.py (#153027)
Part of: #152952

Here is the definition of `torch.types.Device`:

ab997d9ff5/torch/types.py (L74)

It contains `int`, so the `int` in `Union[Device, int]` is redundant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153027
Approved by: https://github.com/Skylion007
2025-05-11 23:32:59 +00:00
0104ac0f6f [Ez][BE]: Fix click ImportError in torch/csrc/jit (#153323)
Fixes unnecessary import for torch script. Unblocks #153020 as it appears to fix circular importer linter into importing every Python file under torch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153323
Approved by: https://github.com/ngimel, https://github.com/cyyever
2025-05-11 19:16:01 +00:00
c51bdf5acf [export] Exporter API prototype. (#153205)
Summary: see inline code comments for documentation

Test Plan:
CI

buck2 test --flagfile fbcode//mode/opt fbcode//caffe2/test:test_export -- -r TestPackage

Differential Revision: D74426900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153205
Approved by: https://github.com/tugsbayasgalan
2025-05-11 14:20:09 +00:00
909ec495b8 [audio hash update] update the pinned audio hash (#153301)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153301
Approved by: https://github.com/pytorchbot
2025-05-11 03:47:56 +00:00
1f5cf19f56 [cutlass backend] Use src code to generate cutlass gemm name (#153006)
This shaves off 40s for at least small cases, since we don't have to recompile the kernel again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153006
Approved by: https://github.com/mlazos
2025-05-11 00:57:03 +00:00
64518ca742 Forward fix #151727 (#153306)
#151727 is failing internally with the following error `error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153306
Approved by: https://github.com/eqy, https://github.com/cyyever, https://github.com/wdvr
2025-05-11 00:39:59 +00:00
fdc387ec7c Revert "refine fp32 precision api (#125888)"
This reverts commit 4c11b26158691cfd9ad48338ddebd1ca9bded788.

Reverted https://github.com/pytorch/pytorch/pull/125888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some failures on ROCm ([comment](https://github.com/pytorch/pytorch/pull/125888#issuecomment-2869274791))
2025-05-11 00:35:46 +00:00
e4f22822cb Revert "Cleanup VS 2019 refs in pytorch (#145863)" (#152613)
This reverts commit b45e6fa707ced2adb68eaf1a2c1ccb389a6283d7.

revert PRs:
https://github.com/pytorch/pytorch/pull/145863
https://github.com/pytorch/pytorch/pull/145319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152613
Approved by: https://github.com/atalman, https://github.com/malfet
2025-05-10 19:33:26 +00:00
4f068598c4 [BE] Delete now unused mac-mps.yml (#153263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153263
Approved by: https://github.com/Skylion007, https://github.com/cyyever
ghstack dependencies: #153013, #153057, #152719
2025-05-10 19:10:41 +00:00
d22c40373f [Ez][BE]: Fix KeyError LOGNAME (#153324)
Unblocks #153020 which accidentally improves the CircularImportLinter to check all Python files. It doesn't set a logname so it errors, there is another FSDP script which already defaults LOGNAME to '' if not specified, this does the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153324
Approved by: https://github.com/awgu
2025-05-10 18:23:38 +00:00
6a84fe65ec Fix code portability when looking for Dot (#153259)
When trying to plot a trace graph, Inductor checks if "dot" is installed. Currently, the code runs a "which dot" command.

By default, Windows doesn't have the "which" command. This patch replaces it with the more portable alternative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153259
Approved by: https://github.com/Skylion007
2025-05-10 16:12:44 +00:00
01cbf5a30a [AOTInductor] Add wrapper and kernel code to debug code logging (#153181)
This is a simple PR to make the AOTInductor wrapper and kernel code get output by `TORCH_COMPILE_DEBUG=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153181
Approved by: https://github.com/desertfire
2025-05-10 15:31:18 +00:00
01bb249978 Revert "has_triton: Use the device interface for detecting Triton availability (#139171)"
This reverts commit 48bfe9afc70a98addd5aa738bf501c029e4a9285.

Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/masnesral due to Performance regression for huggingface ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2868939790))
2025-05-10 14:46:23 +00:00
70c8047c2d include user stacks with constraint violation error message (#152924)
Fixes #152918

Before:

```
File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 5588, in produce_guards_verbose
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.
```

After:

```
File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 5588, in produce_guards_verbose
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.

User stack:
  File "/home/bobren/local/a/pytorch/error.py", line 5, in foo
    return torch.randn(5) * x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152924
Approved by: https://github.com/pianpwk
2025-05-10 13:36:47 +00:00
4c11b26158 refine fp32 precision api (#125888)
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32  internal computation data types . Instead, we will directly use the algorithm to represent it.

### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
 - The names are more informative. 'tf32' is more informative than a simple "high".
 - Easier to extend new algorithm like `tf32x3`
#### Cons
 - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.

### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')
![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067)

### We provide 3 fp32 compute precision can be set:
 - **"ieee"**: Not allowed to use any other internal computation data types .
 - **"tf32"**: Allowed to use tf32 as internal computation data types.
 - **"bf16"**: Allowed to use bf16 as internal computation data types.
 - **"none"**:  Precision's are not set. Can be override by its father node.

### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
    backend = cuda, op = all, precision setting = none
        backend = cuda, op = conv, precision setting = tf32
        backend = cuda, op = rnn, precision setting = tf32
        backend = cuda, op = matmul, precision setting = none
    backend = matmul, op = all, precision setting = none
        backend = matmul, op = conv, precision setting = none
        backend = matmul, op = rnn, precision setting = none
        backend = matmul, op = matmul, precision setting = none
```
 - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
 - If the user set `torch.backends.fp32_precision="bf16"`,  `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".

### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
 - If the user only uses previous APIs, it will work as previous expectations.
 - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.

### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-05-10 11:13:04 +00:00
b5f1345f72 [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-10 08:27:56 +00:00
15166be691 [Dynamo] Fix typing in graph_deduplication.py (#152572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152572
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570
2025-05-10 08:27:56 +00:00
50df08eb5e [Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506
2025-05-10 08:27:45 +00:00
779e647999 [Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410
2025-05-10 08:27:31 +00:00
bc8b305eb8 [Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505
2025-05-10 08:27:19 +00:00
f9e3a9058e [Hierarchical Compilation] Use universal flatten APIs (#152505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389
2025-05-10 08:27:07 +00:00
c2936ebfd5 [Hierarchical Compilation] Track node mutations (#152389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389
Approved by: https://github.com/anijain2305
2025-05-10 08:27:01 +00:00
bc4cf1c13a [BE] fix failing test_dp_state_dict_save_load on ROCm CI where world_size=7 (#153283)
**Summary**
I saw an unrelated CI failure `distributed/_composable/fsdp/test_fully_shard_state_dict.py::TestFullyShardStateDictMultiProcess::test_dp_state_dict_save_load` in one of my PR: https://hud.pytorch.org/pr/pytorch/pytorch/153225#41930032096

This is caused by triggering uneven sharding in FSDP2 at cbb03e6971/torch/distributed/fsdp/_fully_shard/_fsdp_param.py (L353-L361)

This didn't show up because the cuda CI has even number of GPUs (e.g. 2/4/8) but it's not true on ROCm CI. For the failing CI case, the device number is 7.

**Solution**
Skip the test if `self.world_size` can not divide `mlp_dim` (i.e. 16).

**Test**
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153283
Approved by: https://github.com/fegin, https://github.com/weifengpy
2025-05-10 04:46:32 +00:00
fc7d8c6808 [Pipelining] Fix _batch_p2p bug for non-NCCL backends (#132644) (#152938)
Fixes #132644

`_batch_p2p` incorrectly assumes that `dist.batch_isend_irecv` returns a single-element list of `dist.Work`, likely due to NCCL's coalescing behaviour.

For none NCCL backends like Gloo, multiple `dist.Work` objects are returned, causing the code to discard some operations via `.pop()`. This leads to deadlocks during pipeline parallelism.

## Changes:

* Modified `_batch_p2p` to return `list[dist.Work]` instead of popping a single element.
* Added `_wait_batch_p2p` to call `wait()` on multiple `dist.Work` objects, consuming the result of `_batch_p2p`.
* Updated references from `dist.Work` to `list[dist.Work]`.

## Testing:

* `pippy_bert.py` from #132644 now works with gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152938
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2025-05-10 04:19:38 +00:00
b86d46ff21 [torch][ao] Properly strip tracking stats in _fold_conv_bn_qat for 1D (#152982)
Summary: _fold_conv_bn_qat has logic to remove the tracking stats. Currently, this includes a check that includes only torch.nn.modules.batchnorm.BatchNorm2d. As a result, the tracking stats are not properly removed when 1D is used. This diff updates to fix this.

Test Plan:
Run N7113483 without this fix.

{F1977726982}

```
bento kernel build sensorml
```

Re-run with local version of kernel, containing this diff:

{F1977727151}

Notice that now, num_batches is removed.

Differential Revision: D74269649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152982
Approved by: https://github.com/andrewor14, https://github.com/yushangdi
2025-05-10 01:20:18 +00:00
9c99ea2991 error out on negative offs or on K=0 in group gemm (#153226)
Error out if K=0 in one of the grouped gemms to avoid hangs in #152668
Also, adds meta function for _scaled_grouped_mm (TODO: do the same for _grouped_mm, unless it's done already)

One weird thing I'm seeing, when running all grouped_gemm tests, I'm erroring out with
```
  File "/data/users/ngimel/pytorch/torch/_inductor/graph.py", line 1246, in call_function
    out = lowerings[target](*args, **kwargs)  # type: ignore[index]
  File "/data/users/ngimel/pytorch/torch/_inductor/lowering.py", line 445, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 444, in tuned_scaled_grouped_mm
    if is_nonzero and can_use_triton_kernel(mat_a, mat_b, offs, bias):
  File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 375, in can_use_triton_kernel
    offs is not None
  File "/home/ngimel/.conda/envs/pytorch_monarch/lib/python3.10/site-packages/sympy/core/relational.py", line 516, in __bool__
    raise TypeError("cannot determine truth value of Relational")
torch._inductor.exc.InductorError: LoweringException: TypeError: cannot determine truth value of Relational
```
which is weird, there's no relational that sympy has to evaluate in `offs is not None`, and when running this test separately (`test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_use_torch_compile_True_cuda`) it passes. I suspect some autotuning cache has to be reset between runs, but don't know what to look for.
Edit: that error is "fixed" by setting `dynamic=False`, now with correct meat function something's wrong with dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153226
Approved by: https://github.com/kwen2501
2025-05-10 01:13:18 +00:00
639793c17e [pytorch] Expose c10_retrieve_device_side_assertion_info() for use by external code (#153211)
Summary: - Expose `c10_retrieve_device_side_assertion_info()` for use by external code.  The motivating use case is FBGEMM kernel launcher utilities, which add FBGEMM-specific context to the errors coming out of Torch DSA

Test Plan: OSS CI

Differential Revision: D74432771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153211
Approved by: https://github.com/Skylion007
2025-05-10 01:08:45 +00:00
658aea980c [inductor] Rename knobs > triton_knobs in static_cuda_launcher (#153189)
Summary: A follow up from https://github.com/pytorch/pytorch/pull/152457 since I didn't address the comment then

Test Plan: CI

Differential Revision: D74421432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153189
Approved by: https://github.com/jamesjwu
2025-05-10 00:26:21 +00:00
fbb6412fdb Stop uploading sccache stats to benchmark database (#153285)
This is not used for anything atm and potentially bloat up the size of the database
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153285
Approved by: https://github.com/clee2000, https://github.com/malfet
2025-05-10 00:17:38 +00:00
e6dccb036e Revert "Fix fake tensor caching when output has unbacked (#153034)"
This reverts commit 4f425a0397eb0c63b8864bb9b168a519dcfbebbe.

Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Broke pr_time_benchmarks, see d07fbd41e3/1 ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2868100487))
2025-05-09 23:43:56 +00:00
4e24ee7283 Move mps_linear forward to use MPS kernels directly instead of MPSGraph (#152210)
This PR moves `mps_linear` to use MPSNDArrays and call into the MPS kernel directly instead of going through MPSGraph. It also adds a caching mechanism for reusing MPS kernels as there is also a small overhead attached to creating the kernel object.

The impact of the improvement is relatively more significant for small input kernels where the MPSGraph overhead represents a larger portion of the overall execution time of the operation but the speedup shows for both small and large input sizes as expected.

`mps_linear` before the changes:
```
input shapes: f32:[1,1,20], f32:[1,20]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x109d67110>
func(*args, **kwargs)
  Median: 199.29 us
  IQR:    9.56 us (196.71 to 206.27)
  979 measurements, 1 runs per measurement, 1 thread

input shapes: f32:[1,1,5120], f32:[13284,5120]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x1063b4510>
func(*args, **kwargs)
  Median: 979.29 us
  IQR:    25.29 us (964.83 to 990.13)
  205 measurements, 1 runs per measurement, 1 thread
```

`mps_linear` after the changes:
```
input shapes: f32:[1,1,20], f32:[1,20]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x10693a190>
func(*args, **kwargs)
  Median: 176.08 us
  IQR:    15.02 us (172.42 to 187.44)
  1103 measurements, 1 runs per measurement, 1 thread

input shapes: f32:[1,1,5120], f32:[13284,5120]
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x10d524dd0>
func(*args, **kwargs)
  Median: 952.56 us
  IQR:    15.63 us (945.47 to 961.10)
  210 measurements, 1 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152210
Approved by: https://github.com/kulinseth, https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-05-09 23:41:23 +00:00
d07fbd41e3 [BE][MPS] Use squeeze/unsqueeze in Linear (#153288)
Instead of views, to reshape weight to 2D tensor if necessary

Already tested by `test_linear_1d_weight`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153288
Approved by: https://github.com/wdvr
2025-05-09 23:34:54 +00:00
ee326137a9 [reland] Add graph module runtime asserts to AOTI (#153182)
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925

A reland of https://github.com/pytorch/pytorch/pull/152125.

added a try-except around the justknob internally. Also added more documentation

Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.

Also factored out the run time assertion logic to a separate function.

        We need to generate runtime asserts directly in Inductor instead of just re-using the asserts from input graphs becase we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops,
because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already.

One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, a, b, c):
                nz = torch.nonzero(a)
                ones = a.new_ones([nz.size(0), b.size(0)])
                torch._check(ones.size(0) >= 1)
                equals = torch.add(ones, c)
                return equals
        torch._dynamo.mark_dynamic(c, 0)
```
When we re-use the ShapeEnv in Inductor lowering, the check that checks a and nonzero have the same shape would be evaluted to True after we resolve unbacked bindings using the ShapeEnv.
See `test_unbacked_equals_input_size_runtime_assertion` in test_aot_inductor.

In addition to the Inductor generated runtime asserts, we also need the runtime asserts from the input graph, because some derived runtime asserts are not generated in Inductor. One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, x):
                y = x.reshape(100, -1).clone()
                y = y + 1
                return y

        dynamic_shapes = {
            "x": {0: torch.export.Dim.DYNAMIC},
        }
        x.shape[0] needs to be a multiple of 100.
```
See `test_aoti_runtime_asserts_backed_symint` in test_aot_inductor.

Example:

```
    def forward(self):
        arg0_1: "f32[s35]";

        arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)

         #
        mod: "Sym(Mod(s35, 100))" = sym_size_int % 100;  sym_size_int = None
        eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0;  mod = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'");  eq_2 = _assert_scalar = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]);  arg0_1 = None
        clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view);  view = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
        add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1);  clone = None
        return (add_6,)
```

Generated cpp code:

```
    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
    auto arg0_1 = std::move(inputs[0]);
    auto arg0_1_size = arg0_1.sizes();
    int64_t s35 = arg0_1_size[0];
    inputs.clear();
    auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchinductor_dynamic_shapes -- -r test_unbacked_floordiv_simplify
TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_i64_input_codegen_cuda
TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1  buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r  test_unbacked_equals_input_size
```

Differential Revision: D74361799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153182
Approved by: https://github.com/henrylhtsang
2025-05-09 22:56:19 +00:00
298b43792b [RFC][inductor] Refactor AlgorithmSelectorCache to spit out make_precompile_fn (#153212)
Motivation is that `AlgorithmSelectorCache.__call__` is getting very long and hard to work with. There are nested layers of local functions in it. For example, we pass `precompile_fn`, a local variable, to `do_autotuning`, a local function, which already has a pointer to choices, a local variable, and then have `do_autotuning` calls `choices` in `self.lookup`.

When I was trying to make changes to do_autotuning, I would get `UnboundLocalError: cannot access local variable 'choices' where it is not associated with a value`. But no idea why it was even working in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153212
Approved by: https://github.com/eellison
2025-05-09 22:35:10 +00:00
37f92bbe0a [ROCm][CI] fix nightly build after rocm 6.4 upgrade (#153253)
rocm-smi adds inclusion of drm.h and libdrm-devel package was missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153253
Approved by: https://github.com/jeffdaily, https://github.com/atalman

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-09 22:08:15 +00:00
9ae722cdb4 allocate cuMem memory with rdma flag (#153261)
to be able to register memory with ibverbs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153261
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/Skylion007
2025-05-09 21:48:48 +00:00
f11d7a5978 [ROCm] Update spack includes (#152569)
* Cleans up code in `caffe2/CMakeLists.txt` to remove individual ROCm library include paths and use `ROCM_INCLUDE_DIRS` CMake var instead
* `ROCM_INCLUDE_DIRS` CMake var is set in `cmake/public/LoadHIP.cmake` by adding all the ROCm packages that PyTorch depends on
* `rocm_version.h` is provided by the `rocm-core` package, so use the include directory for that component to be compliant with Spack
* Move `find_package_and_print_version(hip REQUIRED CONFIG)` earlier so that `hip_version.h` can be located in the hip package include dir for Spack
* `list(REMOVE_DUPLICATES ROCM_INCLUDE_DIRS)` to remove duplicate `/opt/rocm/include` entries in the non-Spack case
* Remove user-provided env var `ROCM_INCLUDE_DIRS` since `ROCM_PATH` already exists as a user-provided env var, which should be sufficient to locate the include directories for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152569
Approved by: https://github.com/renjithravindrankannath, https://github.com/jeffdaily

Co-authored-by: Renjith Ravindran <Renjith.RavindranKannath@amd.com>
2025-05-09 21:36:38 +00:00
4f425a0397 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-09 21:17:54 +00:00
cbb03e6971 [BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225
Approved by: https://github.com/kwen2501, https://github.com/fegin
2025-05-09 20:40:54 +00:00
3976e52264 Fix torch.isin decomposition for scalar inputs (#153216)
This patch fixes a corner case of `torch.isin` decompisition when both
inputs are scalars. This pattern showed up from #141196.

Fixes #141196.

Error stack befor this patch:
```
  File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12503, in test_scalar_isin_decomposition
    res = opt_f()
          ^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 691, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1618, in _call_user_compiler
    raise BackendCompilerFailed(
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1593, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/__init__.py", line 2365, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_inductor/compile_fx.py", line 2317, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/backends/common.py", line 106, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1179, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 923, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1164, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 576, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 826, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 180, in aot_dispatch_base
    fw_module, updated_flat_args, maybe_subclass_meta = aot_dispatch_base_graph(  # type: ignore[misc]

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2199, in _trace_inner
    t = dispatch_trace(
        ^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_compile.py", line 51, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1223, in dispatch_trace
    graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/_symbolic_trace.py", line 850, in trace
    (self.create_arg(fn(*args)),),
                     ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1278, in wrapped
    out = f(*tensors)  # type:ignore[call-arg]
          ^^^^^^^^^^^
  File "<string>", line 1, in <lambda>
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 720, in inner_fn
    outs = fn(*args)
           ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 419, in _functionalized_f_helper
    f_outs = fn(*f_args)
             ^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 81, in inner_fn
    outs = fn(*args)
           ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 902, in functional_call
    out = PropagateUnbackedSymInts(mod).run(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 171, in run
    self.env[node] = self.run_node(node)
                     ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7387, in run_node
    result = super().run_node(n)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 240, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 320, in call_function
    return target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1326, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_subclasses/functional_tensor.py", line 511, in __torch_dispatch__
    outs_unwrapped = func._op_dk(
                     ^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1428, in __torch_dispatch__
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 797, in proxy_call
    r = maybe_handle_decomp(proxy_mode, func, args, kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2358, in maybe_handle_decomp
    out = CURRENT_DECOMPOSITION_TABLE[op](*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5108, in isin
    return isin_default(elements, test_elements, invert=invert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5137, in isin_default
    x = elements.view(*elements.shape, *((1,) * test_elements.ndim))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: view() received an invalid combination of arguments - got (), but expected one of:
 * (torch.dtype dtype)
 * (tuple of ints size)

While executing %isin : [num_users=1] = call_function[target=torch.isin](args = (%x, %x), kwargs = {})
GraphModule: class GraphModule(torch.nn.Module):
    def forward(self):
         # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12498 in f, code: x = torch.tensor(0)
        x: "i64[][]" = torch.tensor(0)

         # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12499 in f, code: return torch.isin(x, x)
        isin: "b8[][]" = torch.isin(x, x);  x = None
        return (isin,)

Original traceback:
  File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12499, in f
    return torch.isin(x, x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153216
Approved by: https://github.com/williamwen42, https://github.com/peterbell10
2025-05-09 20:26:25 +00:00
180cbf46f2 Fix 'TensorBox' object has no attribute 'is_input_buffer' (#152980)
Summary: Fix for https://fb.workplace.com/groups/1075192433118967/permalink/1664491270855744/

Test Plan: Used reproducer from D74262030

Differential Revision: D74270090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152980
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-09 19:58:32 +00:00
d808a3e203 [dynamic shapes] guard_or_false for computeStorageNbytes (#150483)
removes fast path for computing storage, fixes some adjacent tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483
Approved by: https://github.com/laithsakka
2025-05-09 19:31:19 +00:00
fe11d300ac [nativert] Improve MPMCQueue tests. (#153154)
Summary:
- Use std::this_thread::yield and stop busy wating.
- Sort test file orders.

Following up @swolchok's comment from https://github.com/pytorch/pytorch/pull/152837
Test Plan: CI

Differential Revision: D74402536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153154
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-09 19:25:42 +00:00
287b1ca30c [Ez][BE]: Ensure matplotlib remains optional dependency via fake_quantize (#153244)
Unblocks #153055 and ensure that matplotlib should always be optional in PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153244
Approved by: https://github.com/albanD
2025-05-09 19:19:30 +00:00
90fde0dc09 [ONNX] Support sym_float (#153200)
Fixes #153115

Note: torch.sym_int is not supported in this PR because it's not appeared in exported program, instead, it's `torch.ops.aten.sym_size.int()`.

```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[s35, s16]"):
             #
            sym_size_int_1: "Sym(s35)" = torch.ops.aten.sym_size.int(x, 0);  x = None
            return (sym_size_int_1,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153200
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-05-09 19:10:17 +00:00
da0b89bcbf Scheduler Flops refactor (#152708)
This refactors `estimate_flops` and `get_estimated_runtime` on scheduler nodes:
1. New function on BaseSchedulerNode: `estimate_flops`. Works with all types of ir nodes now, not just `ExternalKernels`.
1. Extends `get_estimated_runtime` to work with non-`ExternalKernels`.

Prelude to: https://github.com/pytorch/pytorch/pull/149697

Testing:
New unit tests cover functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152708
Approved by: https://github.com/xmfan, https://github.com/eellison
2025-05-09 19:01:43 +00:00
073b0257ba [Graph Partition] Maintain relative order within partition during reordering (#153111)
PR #151968 adds `reorder_for_minimizing_partition` for the minimal number of partitions. If reordering two nodes cannot reduce the number of partitions, `reorder_for_minimizing_partition` should maintain the relative order of these two nodes and rely on other reorder passes for some nice features, such as shorter liveness duration or less peak memory. In an extreme case, when all nodes are on gpu and can be cudagraphed, `reorder_for_minimizing_partition` should not reorder any nodes.

This PR improves `reorder_for_minimizing_partition` for the invariant: relative order of nodes within the same graph partition are maintained. To do so, we record the index of each node in the input `nodes: list[BaseSchedulerNode]` and use a heap to pop the node with the smallest index. So we always scheduler a node with smaller index in the same graph partition and respects the invariant. Previous implementation tried to use a queue to achieve that but failed. Because node_N at the end may rely on node_1 at the start, such that node_N is added to queue once node_1 is scheduled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153111
Approved by: https://github.com/eellison
2025-05-09 18:49:53 +00:00
ec24f8f58a Format all headers under ATen/cpu/vec, not just top-level (#152364)
not formatting these seems like an oversight. Had to add a few clang-format suppressions to keep includes in the same order to avoid breaking builds.

This PR was generated using `lintrunner --paths-cmd "rg --files -g '*.h' aten/src/ATen/cpu/vec/" format`

Differential Revision: [D73802128](https://our.internmc.facebook.com/intern/diff/D73802128/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152364
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/CaoE
2025-05-09 18:46:07 +00:00
76e34e3850 [Kineto] Upgrade the kineto commit to fb36cce (#152007)
XPU intends to upgrade oneAPI version(https://github.com/pytorch/pytorch/issues/151097) to support torch Distributed. However, the PTI within the oneAPI to be upgraded introduces breaking changes. It changed the signature of the APIs as follows.
- ptiViewEnableRuntimeApi
- ptiViewGetApiIdName

To avoid the breaks due to the PTI upcoming non-backward-compatible changes, we refined the XPU PTI integration with the kineto. We check the PTI version and then invoke the PTI API accordingly. It means that the kineto of this PR can overcome the non-backward-compatible issue for the sake of the upcoming oneAPI 2025.1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152007
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/sraikund16, https://github.com/malfet
2025-05-09 18:38:41 +00:00
192f7140d1 [fbgemm_gpu] Replace C10_CUDA_KERNEL_LAUNCH_CHECK() in the KernelLauncher (#153178)
Summary:
- Replace `C10_CUDA_KERNEL_LAUNCH_CHECK()` in the `KernelLauncher`, as the
  latter does not print __FILE__ and __LINE__

The existing `C10_CUDA_KERNEL_LAUNCH_CHECK()` implementation does not print the source file and line number when a CUDA kernel launch throws an error, leaving users confused with a context-less message like `CUDA error: invalid arguments`.  This new check is a slimmed re-implementation of the macro with extra context information added to the error (beyond just file and line number) so that we can at least locate the FBGEMM source file or template where the error first surfaces.

Test Plan:
```
buck2 run 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher

buck2 run 'fbcode//mode/opt-amd-gpu' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
```

Reviewed By: sryap

Differential Revision: D74364031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153178
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-05-09 17:43:16 +00:00
595e21a9dd [cutlass-3] Add cutlass key for fbcode and OSS (#153081)
Differential Revision: [D74337959](https://our.internmc.facebook.com/intern/diff/D74337959/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153081
Approved by: https://github.com/drisspg
2025-05-09 17:38:31 +00:00
ffda46e3be [Graph Partition] remove weak dep from partition_input_names (#152863)
Graph partition analyzes read_writes to get partition input names. However, weak dep is fake dependency and is not actually read or written. So we should not include weak dep in graph partition input names.

The following test failure is fixed by removing weak dependency from partition_input_names:
`PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py TestTorchDeviceTypeCUDA.test_params_invalidated_with_grads_invalidated_between_unscale_and_step_Adam_cuda_float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152863
Approved by: https://github.com/eellison
2025-05-09 17:20:04 +00:00
286de0d601 [CI] Enable XCCL in XPU CI build (#150927)
As XCCL has been enabled for torch xpu, enable it in CI build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150927
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/atalman
2025-05-09 17:12:34 +00:00
e73a4c3643 [BE][CI] Merge regular and MPS test config shards (#152719)
Unsure why there were separate to beging with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152719
Approved by: https://github.com/seemethere, https://github.com/atalman
ghstack dependencies: #153013, #153057
2025-05-09 17:01:35 +00:00
309ecb2277 [CI] Add opt-in h100 tests (#153170)
So far only run:
 - inductor/test_fp8.py
 - test_matmul_cuda.py
 - inductor/test_max_autotune.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153170
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-05-09 17:01:05 +00:00
8ea95d2e73 [inductor] dtype promotion error in cat decomp (#152995)
cloning single tensor wasn't following dtype promotion rules
for SAM model: https://github.com/pytorch/pytorch/issues/152606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995
Approved by: https://github.com/yushangdi, https://github.com/eellison
2025-05-09 16:58:58 +00:00
e21ff9c3be Add logging for guard miss failure (#153125)
Differential Revision: [D74371381](https://our.internmc.facebook.com/intern/diff/D74371381/)

This PR adds some logging for guard misses to tlparse, so that we know when AOTAutogradCache and FxGraphCache miss due to guards.

Example tlparse result:
https://gist.github.com/jamesjwu/afa19335c0aee85b24546b13c1cf6427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153125
Approved by: https://github.com/oulgen, https://github.com/jingsh
2025-05-09 16:51:04 +00:00
9d00f2b375 [autograd][docs] Add more details on why save_for_backward is important in extending autograd note (#153005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153005
Approved by: https://github.com/albanD
2025-05-09 16:36:57 +00:00
50657120a0 Allow workflows to opt-out of experiments (#153085)
This change adds support to allow workflows to opt-out of experiments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153085
Approved by: https://github.com/ZainRizvi

Co-authored-by: Zain Rizvi <ZainRizvi@users.noreply.github.com>
2025-05-09 16:34:46 +00:00
18e13a67ce [dynamo] Harden torch function dispatchability check for attributes and methods access (#153082)
See more details in
https://github.com/pytorch/pytorch/issues/151771#issuecomment-2836372110.

Fixes #151771.

Differential Revision: [D74342291](https://our.internmc.facebook.com/intern/diff/D74342291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153082
Approved by: https://github.com/mlazos
2025-05-09 16:14:23 +00:00
c227865720 [AOTInductor] Fix state of ConstantFolding (#153152)
Summary:
Bug fix for constant folding states. We are not setting the correct state for each updates.
One race condition would be:
(1) All threads obtain the model_exec_lock from main run.
(2) In second round of updated constant buffer, we should have set secondary as INITIALIZED but primary is mistakenly set instead.
(3) run_const_fold get called and an model_exec_lock is obtained, waiting for available at this time.
(4) main run enters INITIALIZED, waiting for unique_lock (which a shared_lock is being held by (3) at this moment)

Test Plan:
TBD

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153152
Approved by: https://github.com/jingsh, https://github.com/chenyang78
2025-05-09 16:03:05 +00:00
f2ea63658f Refactor nested benchmarking functions in select_algorithm.py (#153084)
Summary: I'll need some of the benchmark-related functions surfaced so I can use them for remote autotuning. This PR just lifts the main in-process benchmarking helpers to classmethods. It wasn't strictly necessary to also move the sub-process benchmarking helper, but I think it improves readability. Also added some missing types.

Test Plan: Existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153084
Approved by: https://github.com/aorenste, https://github.com/eellison
2025-05-09 15:09:51 +00:00
916f6bafe7 Fix HF loading when there's no metadata file to work with fsspec (#152856)
Summary: HF loading when there is no metadata is an edge case for some users. We were previously calling safe_open(filename) to get the keys in the safetensors file, but this doesn't work with fsspec, when models have a different backend than local fs (ie. hf, s3 etc). This diff updates to open the file with fsspec.open() and then safetensors.deserialize() to get the keys

Test Plan: unit test and e2e test reading from hf

Differential Revision: D74181513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152856
Approved by: https://github.com/joecummings
2025-05-09 13:32:01 +00:00
e06a08059a Add device guard for xpu conv on multi device (#153067)
# Motivation
fixes https://github.com/pytorch/pytorch/issues/153022
The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

# Additional Context
run the following script
```python
import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)

with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")
```
The output is
```bash
         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153067
Approved by: https://github.com/albanD, https://github.com/EikanWang
2025-05-09 09:41:51 +00:00
aca2c99a65 xpu: get xpu arch flags at runtime in cpp_extensions (#152192)
This commit moves query for xpu arch flags to runtime when building SYCL extensions which allows to adjust `TORCH_XPU_ARCH_LIST` at python script level. That's handy for example in ci test which gives a try few variants of the list.

CC: @malfet, @jingxu10, @EikanWang, @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152192
Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/albanD
2025-05-09 05:43:50 +00:00
9fa07340fd [Cutlass] Implement memory planning for EVT (#153177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153177
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #153196, #150907
2025-05-09 05:39:05 +00:00
a3154ca34a [Cutlass] Changes to gemm template for EVT (#150907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150907
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #153196
2025-05-09 05:39:05 +00:00
c54aa0da01 [Cutlass] Fix tests (#153196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153196
Approved by: https://github.com/BoyuanFeng
2025-05-09 05:39:05 +00:00
34196301d5 Revert "[CI] Add opt-in h100 tests (#153170)"
This reverts commit f87a0fe2cae5be82ffd845fa7e6053396c8222d1.

Reverted https://github.com/pytorch/pytorch/pull/153170 on behalf of https://github.com/clee2000 due to workflow doesnt have right concurrency group? ([comment](https://github.com/pytorch/pytorch/pull/153170#issuecomment-2864951319))
2025-05-09 03:04:50 +00:00
eqy
b30d276abc [CUDA][cuBLASLt] Fix scale setting for allowFP16AccumulationCuBLAS true case (#153083)
Also add some missing `@onlyCUDA` / support check decorators in `test_matmul_cuda.py`
Should help resolve #151890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153083
Approved by: https://github.com/janeyx99
2025-05-09 02:27:17 +00:00
10234ccefe xpu: rely on sycl/sycl.hpp to include bfloat16.hpp (#152562)
Fixes: https://github.com/intel/torch-xpu-ops/issues/1503

`sycl/ext/oneapi/bfloat16.hpp` header file is a DPC++ compiler internal header. It's not documented for usage (see extension specification linked below) and is not guaranteed to exist. Instead, documented usage of extension suggests to rely on including `sycl/sycl.hpp` which in its turn includes `bfloat16.hpp` header (which is implementation detail).

We stepped into issues by explicitly including `bloat16.hpp` sycl header whithin user facing production environment when `intel-sycl-rt` wheel is installed (which is the dependency of `torch` wheel package built and publicly available for xpu). Compiler includes this file from `intel-sycl-rt` and due to `#pragma once` usage its content is included as well giving redefinitions of symbols in this file (previous inclusion is coming from `sycl/sycl.hpp`):
```
In file included from /workspace/lib/python3.12/site-packages/torch/include/c10/util/BFloat16.h:23:
/opt/intel/oneapi/compiler/2025.0/bin/compiler/../../include/sycl/ext/oneapi/bfloat16.hpp:60:23: error: redefinition of 'BF16VecToFloatVec'
   60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) {
      |                       ^
/workspace/include/sycl/ext/oneapi/bfloat16.hpp:60:23: note: previous definition is here
   60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) {
      |
```
While SYCL header files themselves can be improved (`#pragma once` dropped), we still must correct usage of sycl `bfloat16.hpp` header in pytorch, i.e. drop it. This fortunately helps to address the reported issue of redefinitions though follow up on compiler side is still required.

Also, `SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS` used to cover inclusion of `sycl/sycl.hpp` does not make sense since it's defined in this very header. Thus, we should use `SYCL_LANGUAGE_VERSION` instead which is defined on compiler level.

See: f958dce280/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc

CC: @EikanWang, @guangyey, @gujinghui

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152562
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-05-09 02:25:44 +00:00
faff387bfd Mini tutorial for provenance tracking (#152211)
as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152211
Approved by: https://github.com/svekars, https://github.com/eellison, https://github.com/desertfire
2025-05-09 01:41:04 +00:00
f87a0fe2ca [CI] Add opt-in h100 tests (#153170)
So far only run:
 - inductor/test_fp8.py
 - test_matmul_cuda.py
 - inductor/test_max_autotune.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153170
Approved by: https://github.com/drisspg
2025-05-09 01:03:12 +00:00
ab829ec629 [dynamo][pr_time_benchmark] Add dynamo benchmark to stress test inlining (#153159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153159
Approved by: https://github.com/laithsakka
ghstack dependencies: #152883, #153105
2025-05-09 00:09:19 +00:00
cbcb57d09d [CI] Use sccache installed in docker image in xla build (#153002)
The edited comment should have the info.  The code change looks large, but its copied from the install_cache script that our docker images use 6a8006472e/.ci/docker/common/install_cache.sh (L42)

Sccache stopped working on xla at some point near dec 17 2023.  I am not sure what commit caused it.  I think it was having trouble writing to the cache.

Either way, there is an sccache already installed on the docker image, so we should use that instead of a binary from s3 which we're probably no longer sure where it came from/what commit it was built from

The one in the docker image is installed here 69d438ee65/.github/upstream/Dockerfile (L61) and is also very old, so I have https://github.com/pytorch/xla/pull/9102 to update it

sccache still not writing properly, i will investigate, but xla build currently broken after the above xla pr, and this should fix it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153002
Approved by: https://github.com/malfet
2025-05-08 23:22:20 +00:00
0203f89cc1 Revert "[BE]: Add PEP621 project section to pyproject.toml (#153055)"
This reverts commit 5976419c6939207834492a1f5fba4a62f2c91b0d.

Reverted https://github.com/pytorch/pytorch/pull/153055 on behalf of https://github.com/malfet due to And failures seems related to this change, but I don't know how, see for example 7cb5c751c3/1 ([comment](https://github.com/pytorch/pytorch/pull/153055#issuecomment-2864664725))
2025-05-08 23:17:58 +00:00
7cb5c751c3 Fix the basic description of torch.min(), torch.max(), torch.all(), torch.any() (#152658)
Fixes #152176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152658
Approved by: https://github.com/malfet
2025-05-08 22:59:14 +00:00
5683965f02 [ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/eqy
2025-05-08 22:38:23 +00:00
5dd746b4b5 [c10d] Reduce test verbosity (#153116)
Has been seeing a lot of `Starting event listener thread for rank` recently in test print-out. Moving them to `logger.debug`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153116
Approved by: https://github.com/fduwjj
2025-05-08 22:22:22 +00:00
5a8c9c3ab0 [FSDP2][Doc] add pointer to torchtitan (#153079)
<img width="838" alt="Screenshot 2025-05-08 at 10 51 05 AM" src="https://github.com/user-attachments/assets/4cf43a16-3801-424b-a74f-ede1d41ff052" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153079
Approved by: https://github.com/mori360
2025-05-08 22:22:07 +00:00
88b56774bd At least one of ROCM_HOME or CUDA_HOME must be None (#152236)
Copied description by @hj-wei from
https://github.com/ROCm/pytorch/pull/1809

> Hi all, I manually generating nvcc to bypass NVIDIA component
checks(Megatron-LM),
see
2da43ef4c1/megatron/legacy/fused_kernels/__init__.py (L57)

> but it can lead to incorrect CUDA_HOME configurations. This can cause
initialization anomalies in downstream libraries like DeepSpeed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152236
Approved by: https://github.com/jeffdaily
2025-05-08 22:20:25 +00:00
4064062e18 [c10d] Test multiple CUDA Graph captures (#150040)
1. Do multiple captures
2. Perform multiple collectives in one capture
3. Multiple replays (existing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150040
Approved by: https://github.com/fduwjj
2025-05-08 22:14:03 +00:00
d9dc6b56ec Support using SymInt shapes for torch.baddbmm no-broadcast case (#153112)
A typical `bmm` kernel in Helion needs to pass in symint shapes to `torch.baddbmm`. Currently `self.expand((dim1, dim2, dim3))` in baddbmm runs unconditionally and it doesn't work with symint shapes (it raises the following error):
```
Traceback (most recent call last):
  File "/home/willfeng/local/helion_yf225/helion/_compiler/type_propagation.py", line 699, in propagate_call
    CheckForIndexCalls.retry_call(self.value, proxy_args, proxy_kwargs),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/helion_yf225/helion/_compiler/tile_index_proxy.py", line 104, in retry_call
    return fn(*proxy_args, **proxy_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1338, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1986, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1450, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 2645, in _dispatch_impl
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_ops.py", line 806, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_meta_registrations.py", line 2172, in meta_baddbmm
    self = self.expand((dim1, dim2, dim3))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: /home/willfeng/local/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd_0.cpp:5025: SymIntArrayRef expected to contain only concrete integers
```
This PR changes it so that we don't run `expand()` when not necessary, which makes the Helion use case (i.e. no broadcasting) work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153112
Approved by: https://github.com/jansel
2025-05-08 21:34:24 +00:00
4166373908 [dynamic shapes] guard_or_false for infer_size (#152146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152146
Approved by: https://github.com/laithsakka
2025-05-08 21:27:22 +00:00
5976419c69 [BE]: Add PEP621 project section to pyproject.toml (#153055)
Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling.

* Enables modern tooling like uv sync and better support for tools like poetry.
* Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version).
* Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments.
* This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels

These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging.

Without this, `uv sync` complains:
```
error: No `project` table found in: `pytorch/pyproject.toml`
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055
Approved by: https://github.com/ezyang
2025-05-08 21:27:19 +00:00
9608e7fee9 [nativert] Address tooling setup for torch/nativert/ (#153164)
Summary:
As discussed with @malfet , we're porting nativert code to torch/nativert/.
Following up some concerns over the new directory, I'm trying to setup the tooling on OSS so various things (like linters) can run on torch/nativert/ properly.

Test Plan: CI

Differential Revision: D74407808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153164
Approved by: https://github.com/dolpm, https://github.com/Skylion007
2025-05-08 21:11:33 +00:00
e820b05cab [inductor] Generate synthetic offsets appropriately for autotuning _scaled_grouped_mm (#152968)
Summary: The autotuner is using zero-filled tensors to autotune
_scaled_grouped_mm and that's not appropriate for the offsets tensor, since it
essentially corresponds to "no input" and thus yields invalid perf results.

We can't really use the actual input tensors, since we might be compiling this
op in the context of an entire graph.

So instead, I decided to create a synthetic offsets tensor assuming that each
group is (roughly) the same size.  I don't have data but I'd guess this
approach is OK for MoE since we're generally hoping to load-balance the
experts; I'm not sure how well it applies to other scenarios that might be more
heavy-tailed.

Test Plan:
```
pytest test_matmul_cuda.py -k test_scaled_grouped_gemm_
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152968
Approved by: https://github.com/ngimel
2025-05-08 21:07:04 +00:00
590965f92f [Graph Partition][Flex Attention] analyze symints from subgraph inputs and outputs (#152878)
Flex Attention may have symints in subgraph inputs and outputs. Existing code implicitly captures these symints but does not explicitly store it in TritonTemplateBuffer. This leads to error when analyzing symints used in Flex Attention as a TritonTemplateBuffer. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152878
Approved by: https://github.com/drisspg
2025-05-08 20:25:35 +00:00
6ae7730eeb Use gcc13 in Manylinux 2.28 images (#152825)
Related to: https://github.com/pytorch/pytorch/issues/152426
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152825
Approved by: https://github.com/malfet
2025-05-08 20:04:48 +00:00
8b8051f6ed [Minimizer] Fix the path naming (#153130)
Summary:
Added some logging and captured the indexing. See below image.
{F1977773416}

This is why the saved module path is called `/tmp/jimwan/minimizer_a_acc.pt`

Now the updated module paths are `/tmp/jimwan/minimizer_addmm_default_103_acc.pt`.

Test Plan:
```
MTIAC_USE_DIST_REF_KERNELS=all  buck2 run @//mode/opt mtia/accuracy/minimizer:mtia_minimizer_runner --  --mode sequential  --compare_fn allclose  --pt_save_dir  /tmp/debug3  --atol 1e-4 --rtol 1e-4 --all_outputs --start_idx native_layer_norm_default_80 --end_idx getitem_272 2>&1 | tee ~/test.log
```
{F1977773610}

Reviewed By: qcyuan

Differential Revision: D74369107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153130
Approved by: https://github.com/Skylion007
2025-05-08 19:59:52 +00:00
086e2c2399 [TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ (#152814)
The float8 row-wise scaled matmuls are not supported on Blackwell yet. This PR adds skips to those tests to decrease the noise on `sm_120+` machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152814
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-05-08 19:34:20 +00:00
4b8b7c7fb9 [CI] Use cmake from pip instead of conda in CI docker images (#152537)
As in title

idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31

Just make everything install cmake via the requirements-ci.txt.  I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip

Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
2025-05-08 18:58:10 +00:00
b3524080dc [AOTInductor] Generate kernels separately for const graph and main graph (#153040)
Summary:
We should generate the kernel for const graph and main graph separately.
The reason is that when we run autotuning, we would create separate
kernel calls and we should make sure that main graph also contains the
runner.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_autotune_with_constant_folding

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D74347765](https://our.internmc.facebook.com/intern/diff/D74347765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153040
Approved by: https://github.com/angelayi
2025-05-08 18:45:45 +00:00
e5f869999c [inductor] Fix ModularIndexing assumptions (#152993)
Fixes https://github.com/pytorch/pytorch/issues/151198.

Since the result of ModularIndexing can be zero due to the modulo
operation, we should not make any assumption about ModularIndexing
being positive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152993
Approved by: https://github.com/yf225
2025-05-08 18:26:45 +00:00
d900c68ea6 c10d/gloo: add ibverbs backend (#153015)
Summary:
X-link: https://github.com/pytorch/gloo/pull/437

This provides a new "UnboundBuffer" implementation for Gloo ibverbs backend so it can be used with PyTorch.

This currently is passing basic tests such as `reduce_test` and `send_recv_test` but there are a number of failures. Putting this up for review so the follow up fixes are less of a mega PR and also so we can start doing some initial testing with this E2E with PyTorch.

Known issues:

* support recv from any is not supported
* AllreduceBcubeBase2 is failing

Test Plan:
```
buck2 run mode/dbgo //gloo/test:send_recv_test_ibverbs
buck2 test //gloo/test:

GLOO_DEVICE_TRANSPORT=IBVERBS buck2 run @//mode/opt //caffe2/test/distributed:c10d -- -r '.*gloo.*' -f
```

We can't run any of the gloo tests in CI since none of our CI machines have ibverbs so they're disabled by default and need to be manually run.

Differential Revision: D73291471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153015
Approved by: https://github.com/fduwjj
2025-05-08 18:26:29 +00:00
7cdf5048ea Fix evaluate_expr to include suppress_guards_tls in cache key (#152661)
ShapeEnv.evaluate_expr() behaves differently based on the (tls) global "suppress_guards" - so its cache key needs to include that value.

This came up because #152662 triggered it in the test `test/dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error` - fixing this caused that test to work again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152661
Approved by: https://github.com/laithsakka
2025-05-08 18:25:34 +00:00
30a3c5d970 Skip lintchecks for now (#153156)
As devs has been complaining it's failing. Completely remove them from lint.yml as https://github.com/pytorch/pytorch/pull/153157 moved it to nightly

See https://github.com/pytorch/pytorch/issues/152439  as well as https://github.com/pytorch/pytorch/issues/152884 and https://github.com/pytorch/pytorch/issues/152489 for more details

Was introduced in https://github.com/pytorch/pytorch/pull/152377
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153156
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2025-05-08 17:58:05 +00:00
e86b6b2a19 Add tests to check pretty print when padding is a string in C++ API (#153126)
Currently there are no tests to verify the behaviour of pretty print when padding is `torch::kSame` or `torch::kValid`. This PR just adds this tests to check for future regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153126
Approved by: https://github.com/Skylion007
2025-05-08 17:55:25 +00:00
d36261d2e6 Revert "[dynamo] Avoid running torch.nn.Module.__call__ twice under torch.compile(mod) (#152740)"
This reverts commit 0886d402f155e0b34760a2906f4bd71c878fd98f.

Reverted https://github.com/pytorch/pytorch/pull/152740 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))
2025-05-08 17:31:21 +00:00
34d424d813 Revert "[dynamo] Support delattr on result of torch.compile(module) (#152741)"
This reverts commit 6c025b5a8270e456405eccc26db1344ddd016d7b.

Reverted https://github.com/pytorch/pytorch/pull/152741 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))
2025-05-08 17:31:21 +00:00
6a8006472e Fix doc cosineannealinglr 152081 (#152936)
## Summary

This PR updates the docstring for `CosineAnnealingLR` to accurately reflect its recursive learning rate schedule. The previous docstring displayed only the SGDR closed-form expression, which doesn't match the actual recursive implementation in code.

Changes:

- Added the recursive update formula used in `get_lr()`
- Retained the original closed-form SGDR expression for reference
- Clarified that warm restarts are not implemented in this scheduler

This addresses confusion raised in issue #152081.

## Related issue

[#152081](https://github.com/pytorch/pytorch/issues/152081)

## Testing

Doc-only change. Ran pre-commit to verify formatting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152936
Approved by: https://github.com/janeyx99
2025-05-08 17:25:30 +00:00
3cd69350ed [export] Unflatten None (#153000)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153000
Approved by: https://github.com/pianpwk
2025-05-08 16:40:13 +00:00
7b806a8cb1 Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)"
This reverts commit 93576351270383ca37deaec6b2417a33dc045a93.

Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail an inductor test in trunk ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2863657185))
2025-05-08 16:39:28 +00:00
cyy
d291fa8ecc Avoid std::chrono::system_clock (#153135)
This PR replaces most `std::chrono::system_clock` with `std::chrono::steady_clock` if the duration is used in condition variables. Ideally system clocks should be used only to log wall-clock times.

Some `high_resolution_clock` are also changed to `steady_clock` because its resolution is not required in the context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153135
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/malfet
2025-05-08 16:30:29 +00:00
fe8ebacee4 [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-08 16:12:16 +00:00
05326b7e49 Revert "Add runtime asserts to AOTI (#152125)"
This reverts commit 834bc5e4148538b7544aafdf5b090d007600fbd6.

Reverted https://github.com/pytorch/pytorch/pull/152125 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152125#issuecomment-2863554139))
2025-05-08 15:58:18 +00:00
1d3e8f326a [CI] Increase shards number for XPU ci UT tests (#149113)
The XPU CI test met timeout issue, refer https://github.com/pytorch/pytorch/actions/runs/14897047392/job/41842336828 and this PR will reduce the ci time cost
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149113
Approved by: https://github.com/etaf, https://github.com/EikanWang
2025-05-08 15:42:33 +00:00
8141b146ca Run URL linter on nightly only (#153157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153157
Approved by: https://github.com/malfet
2025-05-08 15:32:42 +00:00
efa07df257 [c10d] Remove unordered PG destroy test (#153110)
torch.distributed does not support unordered ProcessGroup destroy. Removing the test.

Resolves #137507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153110
Approved by: https://github.com/fduwjj, https://github.com/fegin
2025-05-08 15:29:44 +00:00
500cbeee4e [dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119)
Together with https://github.com/pytorch/pytorch/pull/151962, FIXES https://github.com/pytorch/pytorch/issues/133575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152119
Approved by: https://github.com/jansel
ghstack dependencies: #151731, #151962
2025-05-08 15:12:16 +00:00
6dea8ef555 [ca] hide unused scalar int sizes from dynamo (#151962)
together with https://github.com/pytorch/pytorch/pull/151731, FIXES https://github.com/pytorch/pytorch/issues/113129 https://github.com/pytorch/pytorch/issues/146168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151962
Approved by: https://github.com/jansel
ghstack dependencies: #151731
2025-05-08 15:12:16 +00:00
8f380b239f [ca] mark scalar int sizes as dynamic via tensor wrapping (#151731)
This is the only way to support dynamic shapes on scalars right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731
Approved by: https://github.com/jansel
2025-05-08 15:12:08 +00:00
a7ea115494 Revert "[CI] Use cmake from pip instead of conda in CI docker images (#152537)"
This reverts commit 941062894a1accfd472d0acd2716493e1f173bd7.

Reverted https://github.com/pytorch/pytorch/pull/152537 on behalf of https://github.com/malfet due to Sorry to revert this PR, but it broke doc builds, see 4976b1a3a8/1 ([comment](https://github.com/pytorch/pytorch/pull/152537#issuecomment-2863337268))
2025-05-08 14:53:34 +00:00
4976b1a3a8 Keep raw cubin file around in case it gets deleted underneath us (#153064)
This diff hardens StaticCudaLauncher in the event a cubin file gets deleted under us. We store the raw cubin on the static cuda launcher, and reload it as needed. On cold start, this can happen if the cubin file is created by triton, and gets deleted before we can load the kernel on the parent process.

We don't want to store the entire cubin both in file format and in memory for caching purposes, so we delete it before caching the data. In the unfortunate/unlikely event where we can't load/find the necessary file on warm start, skip the stored triton launcher, falling back to regular triton.

This comes at a cost to worker memory, but it's not more memory than regular triton workers already take, so it should be okay.

Tests:
- Make test_static_cuda_launcher always delete the cubin path and reload it

Fixes #153030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153064
Approved by: https://github.com/oulgen, https://github.com/jansel
2025-05-08 14:29:19 +00:00
13bdfe6577 get right function declaration on windows inductor (#152939)
Fixes #152251

`get_export_declaration` introduced one more ')' in Windows platform, which cause this pattern of function declaration different with Linux.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152939
Approved by: https://github.com/xuhancn, https://github.com/jansel
2025-05-08 14:28:33 +00:00
0f9821d0e3 [BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153114
Approved by: https://github.com/cyyever, https://github.com/fegin, https://github.com/H-Huang, https://github.com/Skylion007
2025-05-08 14:01:49 +00:00
2926dd4d8e Stop proxy-ing autograd.Function.ctx into the graph (#152621)
The reason why we did this before is because that's how our older
autograd.Function x Dynamo interaction work, but we've since adopted
newer designs that don't actually need the autograd.Function.ctx proxied
into the graph.

We still need a fx.Proxy for the autograd.Function.ctx object, so
whenever we do I create one via discard_graph_changes.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152621
Approved by: https://github.com/oulgen
2025-05-08 13:32:54 +00:00
22c31046d4 Fixed rerr computation in lobpcg (#152789)
Fixes #101075

This PR fixes an issue with the computation of residuals in the LOBPCG algorithm.

**Bug**: [Line 788](8f54e56e62/torch/_lobpcg.py (L788)) is supposed to compute the denominator in Equation 9 of [Duersch et al., 2018](https://arxiv.org/abs/1704.07458), as also suggested in [line 776](8f54e56e62/torch/_lobpcg.py (L776)), but it uses the raw eigenvalue-estimates instead of their absolute values.

**Consequence**: This made the algorithm's success sensitive to initialization of eigenvectors.

**Tests**:
- I have tested @jtorde's [script](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1545349559), and I did NOT run into any assertion errors for a few minutes (as opposed to the original implementation, which fails after a few seconds).
- I have also tried @pearu's specific [test case](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1548483685), which also executes successfully - the residuals remain positive, and the final output is the same as one returned by SciPy (with and without enforcing the use of LOBPCG).
- I extracted the relevant test cases from [test/test_autograd.py](https://github.com/pytorch/pytorch/blob/main/test/test_autograd.py) and [test/test_linalg.py](https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py), and they ran successfully.

Let me know if further test cases or benchmarks are needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152789
Approved by: https://github.com/pearu, https://github.com/lezcano
2025-05-08 12:22:31 +00:00
34d4363e6d [dynamo] Fix super and classmethod binding of cls object (#153105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153105
Approved by: https://github.com/jansel
ghstack dependencies: #152883
2025-05-08 12:07:08 +00:00
941062894a [CI] Use cmake from pip instead of conda in CI docker images (#152537)
As in title

idk how the install_cmake script is used because I see it being called with 3.18 but when I look at the build jobs some say 3.18 and others 3.31

Just make everything install cmake via the requirements-ci.txt.  I don't know if the comment at 5d36485b4a/.ci/docker/common/install_conda.sh (L78) still holds, but pretty much every build has CONDA_CMAKE set to true, so I'm just defaulting to installing through pip

Also defaulting to 4.0.0 everywhere except the executorch docker build because executorch reinstalls 3.31.something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/malfet
2025-05-08 10:10:27 +00:00
bfc0920d95 [C10D] Move getNcclDataType into NCCLUtils (#153113)
Differential Revision: D74365214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153113
Approved by: https://github.com/ngimel
2025-05-08 08:54:05 +00:00
dfb91a627f Clean up of CUTLASS_VERSION (#152947)
Fixes #152847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152947
Approved by: https://github.com/eqy, https://github.com/cyyever
2025-05-08 08:32:34 +00:00
9357635127 [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-05-08 08:28:05 +00:00
4f9dd3c3e5 [cutlass backend] Fix EVT test for fbcode post cutlass 3.9.2 upgrade (#153106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153106
Approved by: https://github.com/mlazos
2025-05-08 08:20:40 +00:00
f9df09da08 [mm sampling] extract more triton information (#153099)
Summary:
# Why

capture more triton config information that was not being captured

# What

capture and extract

- group_m
- allow_tf32
- acc_type
- matrix_instr_nonkdim
- waves_per_eu
- kpack

to achieve this, add

- matrix_instr_nonkdim
- waves_per_eu
- kpack

to the info_dict of the TritonTemplateCaller

Test Plan:
with D74342290

```
buck2 run -c fbcode.rocm_arch=mi300 -m rocm621 mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0 2>&1 | tee /tmp/tmp.52Igj8lthj/15.txt
```

(edited for clarity and brevity)

```
AutotuneMetrics03LogEntry(
    backend='Triton',
    exectime_ms=0.007449999917298555,
    perf_model_name='scripts.vandrei.pytorch_experiments.matmul_estimator_lib.estimate_matmul_time_new',
    perf_model_exectime_ms=0.009558684365573179,
    config_triton_block_m=16,
    config_triton_block_n=256,
    config_triton_block_k=128,
    config_triton_num_stages=2,
    config_triton_num_warps=8,
    config_triton_group_m=16,
    config_triton_allow_tf32='False',
    config_triton_acc_type='tl.float32',
    config_triton_matrix_instr_nonkdim=16,
    config_triton_waves_per_eu=1,
    config_triton_kpack=2,
    x_batch_dim=0,
    x_row_dim=8,
    x_col_dim=96,
    x_batch_stride=0,
    x_row_stride=96,
    x_col_stride=1,
    x_dtype='torch.float16',
    x_dtype_size=16,
    w_batch_dim=0,
    w_row_dim=96,
    w_col_dim=512,
    w_batch_stride=0,
    w_row_stride=512,
    w_col_stride=1,
    w_dtype='torch.float16',
    w_dtype_size=16,
    vendor='AMD',
    model='gfx942:sramecc+:xnack-',
    major=9,
    minor=4,
    sms=304,
    l2_cache=4194304,
    warp_size=64,
    regs_per_sm=65536,
    max_threads_per_sm=2048,
    total_mem=206141652992,
    hip_version='6.2.41134',
    triton_upstream_hash='3889f3f3b97b817741e308c173409927b7c4536f',
    environment='experiment-xzy-default',
    session_id='8a7001bd-652c-440c-bc56-4cb1e25146ea',
    [...]
)
```

Reviewed By: exclamaforte

Differential Revision: D74342286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153099
Approved by: https://github.com/exclamaforte, https://github.com/eellison
2025-05-08 07:24:28 +00:00
3c87529d23 Make device check error message more descriptive (#150750)
Fixes #122757

## Test Result

```python
import torch

model_output = torch.randn(10, 5).cuda()
labels = torch.randint(0, 5, (10,)).cuda()
weights = torch.randn(5)

loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
loss = loss_fn(input=model_output, target=labels)
print(loss)

Traceback (most recent call last):
  File "/home/zong/code/pytorch/../loss2.py", line 17, in <module>
    loss = loss_fn(input=model_output, target=labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy
    return torch._C._nn.cross_entropy_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750
Approved by: https://github.com/malfet
2025-05-08 06:19:44 +00:00
c73bd990cf fix shard tensor gather when a local tensor on certain ranks has zero elements (#150914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150914
Approved by: https://github.com/fduwjj
2025-05-08 05:06:22 +00:00
94ca3a4666 Add torch._C.Tag.needs_contiguous_strides (#152859)
this forces inductor to force the inputs to be contiguous.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152859
Approved by: https://github.com/eellison
2025-05-08 04:49:59 +00:00
2d25e4d478 [1/n][Optimus][Auto-AC] Support activation quantization without scaling (#148380)
Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB  Down: 42MiB  (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining     0/4                                                                                 6.7s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable (you can overrite the dtype, if nothing given, the default is fp8)

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}
        },
```

Differential Revision: D70522237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380
Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803
2025-05-08 04:44:15 +00:00
6f6fac6a41 [dynamo] Fix bug in hasattr(tensor, "size") (#152883)
Fixes https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152883
Approved by: https://github.com/StrongerXi
2025-05-08 01:16:01 +00:00
834bc5e414 Add runtime asserts to AOTI (#152125)
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925

Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.

Also factored out the run time assertion logic to a separate function.

        We need to generate runtime asserts directly in Inductor instead
        of just re-using the asserts from input graphs becase we reuse the
        same ShapeEnv as before. In particular, on subsequent graph passes,
        we would immediately turn all of these assertions into noops,
        because when we evaluated their expressions, we would see that
        because we had a deferred runtime assert in the ShapeEnv, we
        know "oh, of course this expression is True" already.
        One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, a, b, c):
                nz = torch.nonzero(a)
                ones = a.new_ones([nz.size(0), b.size(0)])
                torch._check(ones.size(0) >= 1)
                equals = torch.add(ones, c)
                return equals
        torch._dynamo.mark_dynamic(c, 0)
```
        When we re-use the ShapeEnv in Inductor lowering, the check that checks
        a and nonzero have the same shape would be evaluted to True after we resolve
        unbacked bindings using the ShapeEnv.
        See test_unbacked_equals_input_size_runtime_assertion in test_aot_inductor.

        In addition to the Inductor generated runtime asserts, we also
        need the runtime asserts from the input graph, because some derived
        runtime asserts are not generated in Inductor. One example is
        below:
```
        class Model(torch.nn.Module):
            def forward(self, x):
                y = x.reshape(100, -1).clone()
                y = y + 1
                return y

        dynamic_shapes = {
            "x": {0: torch.export.Dim.DYNAMIC},
        }
        x.shape[0] needs to be a multiple of 100.
```
        See test_aoti_runtime_asserts_backed_symint in test_aot_inductor.

Example:

```
    def forward(self):
        arg0_1: "f32[s35]";

        arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)

         #
        mod: "Sym(Mod(s35, 100))" = sym_size_int % 100;  sym_size_int = None
        eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0;  mod = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'");  eq_2 = _assert_scalar = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]);  arg0_1 = None
        clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view);  view = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
        add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1);  clone = None
        return (add_6,)
```

Generated cpp code:

```
    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
    auto arg0_1 = std::move(inputs[0]);
    auto arg0_1_size = arg0_1.sizes();
    int64_t s35 = arg0_1_size[0];
    inputs.clear();
    auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
```

Differential Revision: D73596786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152125
Approved by: https://github.com/henrylhtsang, https://github.com/jingsh
2025-05-08 00:27:24 +00:00
20e2ca3e29 [Dynamo] Allow inlining into AO quantization modules (#152934)
This adds dynamo inlining into `torch.ao.quantization.fake_quantize`.

This is needed for QAT compatbility w/ an RL training model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152934
Approved by: https://github.com/williamwen42
2025-05-07 23:58:11 +00:00
5bf0c3518c Detect NVSHMEM location (#153010)
### Changes
- Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem`
- Added link dir via `target_link_directories`
- Removed direct dependency on mlx5
- Added preload rule (following other other NVIDIA libs)

### Plan of Record
1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits)
2. Developer experience: at compile time, prefers wheel dependency than using Git submodule
General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac
4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all.

For now, we have symbol dependency on two particular libs from NVSHMEM:
- libnvshmem_host.so: contains host side APIs;
- libnvshmem_device.a: contains device-side global variables AND device function impls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153010
Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/Skylion007
2025-05-07 23:35:04 +00:00
df1ec045b5 [Cutlass] Add epilogue inputs/outputs to def_kernel (#151406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151406
Approved by: https://github.com/eellison
ghstack dependencies: #152733, #150906
2025-05-07 23:09:02 +00:00
d483aefafa [Cutlass] Integrate EVT into CUDACPPScheduling (#150906)
Previously merged:
* #151713
* #151405
* #150905
* #152306
* #152305

Allow epilogue nodes in cuda combined scheduling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150906
Approved by: https://github.com/eellison
ghstack dependencies: #152733
2025-05-07 23:09:02 +00:00
6b9d741e1c [Cutlass] Handle broadcasting in EVT python codegen (#152733)
Previously merged:
* #151713
* #151405
* #150905
* #152306
* #152305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152733
Approved by: https://github.com/eellison
2025-05-07 23:09:02 +00:00
4270517cbf Fix test/test_optim.py error message. (#153076)
Fixes an error message in test/test_optim.py

Current behavior: If running the test with Adagrad, the error message reads: "SGD does not currently support capturable".

Fix: The error message now says correctly: "Adagrad does not currently support capturable".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153076
Approved by: https://github.com/janeyx99
2025-05-07 22:46:05 +00:00
7706074ece Fix TORCH_CHECK error message in FusedSgdKernel (#153074)
This fixes an issue in the TORCH_CHECK error message in the FusedSgdKernel.

Current behavior: If the LR tensor is not on the same device as the parameters, the error message reads: "found_inf must be on the same GPU device as the params".

Fix: The error message now correctly points out "lr must be on the same GPU device as the params".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153074
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-05-07 22:10:09 +00:00
cecfc7dc53 [CUDA][cuDNN] Fix handling of CPU side input and target length tensors in CTCLoss (#152745)
https://github.com/pytorch/pytorch/pull/128271 migrated to cuDNN V8 CTCLoss which expects input and target length tensors to be on `CUDA` rather than `CPU` without adding the logic to account for the edge case of them being on `CPU`

see also #152421

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152745
Approved by: https://github.com/Skylion007
2025-05-07 22:01:18 +00:00
773a91c775 [ONNX] dynamic_shapes uses DYNAMIC (#153065)
Although Dim.AUTO covers the cases that a user sets more axes to be dynamic than the model actually needs, it silently falls back to STATIC when DYNAMIC fails. This increases the difficulty of debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153065
Approved by: https://github.com/justinchuby
2025-05-07 21:48:41 +00:00
a2891cba2f [cutlass backend] Skip cuda lib path if it is torch/lib (#153003)
Differential Revision: [D74284808](https://our.internmc.facebook.com/intern/diff/D74284808/)

This is a bit risky for cutlass backend, so decided to separate it out. Tested offline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153003
Approved by: https://github.com/chenyang78
2025-05-07 21:28:15 +00:00
5bb154e6fd [nativert] Move MPMCQueue to torch/nativert. (#152837)
Summary:
Torch Native Runtime RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff adds a small library implementing a multi producer multi consumer queue which will be used to synchronize taks for Torch Native Runtime.

Differential Revision: D74184245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152837
Approved by: https://github.com/albanD, https://github.com/dolpm, https://github.com/swolchok
2025-05-07 21:17:42 +00:00
d2ee606e9b [Inductor] Set correct baseline for decomposek test (#152897)
Differential Revision: D74218923

Running on A100 seems to result in precision loss from decompose_k. This was root caused to the fp16/bf16 reduction setting, which establishes a less precise baseline than decompose_k, as decompose_k uses the bmm.dtype overload for fp32 output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152897
Approved by: https://github.com/eellison
2025-05-07 21:02:47 +00:00
1ff3c223d2 [c10d][fr] Make FR vendor neutral so that other backends can use it (#152563)
Current FR code is built with `USE_C10D_NCCL` we should remove it to make it generic. And we keep existing API used by NCCL so that we can have some bc compatibility because lots of use cases are around FR with NCCL. The generic version with C10::Event can then be used for other backend like Gloo, etc.

The current Unit test should cover the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152563
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
ghstack dependencies: #152585
2025-05-07 20:37:40 +00:00
642e9305eb Fixes detection of ArmPL on Linux platform (#150031)
On Linux it failed to detect that there is bin directory as it wasn't looking for armpl-info which is the only file that is in that directory on Linux and also adding link to math library as it is required to link against when checking for LAPACK functions.

Fixes #149610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150031
Approved by: https://github.com/fadara01, https://github.com/malfet
2025-05-07 19:47:21 +00:00
f5f8f637a5 [Typing] Improve device typing for torch.set_default_device() (#153028)
Part of: #152952

Here is the definition of `torch.types.Device`:

ab997d9ff5/torch/types.py (L74)

So `_Optional[_Union["torch.device", str, builtins.int]]` is equivalent to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153028
Approved by: https://github.com/Skylion007
2025-05-07 19:31:43 +00:00
dd7d231ed3 [cutlass backend][test] re-enable test_cuda_compile_command for fbcode (#153001)
Differential Revision: [D74284047](https://our.internmc.facebook.com/intern/diff/D74284047/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153001
Approved by: https://github.com/ColinPeppler
2025-05-07 19:06:24 +00:00
62b7ef06cc [Dynamo] Remove unused guard PYMODULE_MATCH (#152961)
Not used anywhere: https://www.internalfb.com/code/search?q=repo%3Afbcode%20PYMODULE_MATCH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152961
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730, #152865, #152872
2025-05-07 18:58:18 +00:00
d9b8473b59 [Dynamo] Guard serialization for RANGE_ITERATOR_MATCH (#152872)
Tests serialization for RANGE_ITERATOR_MATCH; includes no non-test changes.

This PR handles iterator exhaustion issues by utilizing the janky solution from #152865; it passes a function to generate kwargs and `frame_state.f_locals` is updated with fresh iterators through a second kwarg generation pass after initial tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152872
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730, #152865
2025-05-07 18:58:18 +00:00
52f7106c00 [Dynamo] Guard serialization for TUPLE_ITERATOR_LEN (#152865)
Tests serialization for TUPLE_ITERATOR_LEN; includes no non-test changes.

Passing a tuple iterator as input results in the iterator being exhausted during testing. I threw together a super janky workaround via accepting a func for kwarg generation and replacing `frame_state.f_locals` with newly-generated kwargs to get fresh iterators, but insights into a better approach are welcome!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152865
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730
2025-05-07 18:58:18 +00:00
fb500d0b1c [Dynamo] Guard serialization for SEQUENCE_LENGTH (#152730)
Tests only; no other changes needed. Test logic uses a tuple function input to trigger installation of a SEQUENCE_LENGTH guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152730
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728
2025-05-07 18:58:18 +00:00
42954ab28e [Dynamo] Guard serialization for CLOSURE_MATCH (#152728)
Unsupported because it uses unsupported FUNCTION_MATCH.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152728
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727
2025-05-07 18:58:18 +00:00
a9186ec723 [Dynamo] Guard serialization for FUNCTION_MATCH (#152727)
Unsupported because it uses unsupported ID_MATCH.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152727
Approved by: https://github.com/jansel
ghstack dependencies: #152725
2025-05-07 18:58:18 +00:00
a6f51be2fd [Dynamo] Guard serialization for NN_MODULE (#152725)
Throws an error when attempting to serialize an NN_MODULE guard. It is not supported because it uses the unsupported ID_MATCH guard (#152330):

a6dd1c2208/torch/_dynamo/guards.py (L1738-L1739)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152725
Approved by: https://github.com/jansel
2025-05-07 18:58:17 +00:00
2cf7fd0d2b Update docs of saved_tensors_hooks to avoid ref cycle (#153049)
Fixes #115255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153049
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2025-05-07 18:54:56 +00:00
7cf8049d63 [BE] Update ruamel to 0.18.10 (#153057)
To address the feedback from https://github.com/pytorch/pytorch/pull/153013
Previously it was pinned to 0.17.4, that was released in 2021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153057
Approved by: https://github.com/Skylion007
ghstack dependencies: #153013
2025-05-07 18:11:14 +00:00
d042ec856b Use gather in index_select (#151715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151715
Approved by: https://github.com/ngimel
2025-05-07 17:55:34 +00:00
eqy
172e641529 [CUDA] Rest peak memory stats before running test_set_per_process_memory_fraction (#152540)
Otherwise previous tests can cause `application = int(total_memory * 0.499) - torch.cuda.max_memory_reserved()` to go negative

Hopefully abates current flakiness (see also https://github.com/pytorch/pytorch/issues/135115#:~:text=TestCuda.test_set_per_process_memory_fraction)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152540
Approved by: https://github.com/Skylion007
2025-05-07 17:02:39 +00:00
8b9c9a327f [cutlass backend] cache filtered ops based on layouts (#152580)
Differential Revision: [D73972687](https://our.internmc.facebook.com/intern/diff/D73972687/)

Add cache to store the list of filtered ops for a specific shape + layout + dtype (aka hash on input_nodes).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152580
Approved by: https://github.com/eellison
2025-05-07 16:38:22 +00:00
61dd2a0cc3 Revert "[BE] Update numba versions (#152557)"
This reverts commit 80d2116405367e1dd11648ab4225d4207d5e6132.

Reverted https://github.com/pytorch/pytorch/pull/152557 on behalf of https://github.com/malfet due to This time it breaks torchbench tests, see 9c114934f7/1(inductor_torc&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/152557#issuecomment-2858945427))
2025-05-07 15:03:41 +00:00
9c114934f7 [Lint] Add install command for GHA step (#153013)
Otherwise, it fails to run the script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153013
Approved by: https://github.com/wdvr, https://github.com/cyyever
2025-05-07 14:55:00 +00:00
42b3e560ee Thread through options so GraphPickler can allow all ops (#152801)
Fixes #151904

In #151904 we discussed the feasibility of including all ops in the GraphPickler. This PR changes it so we can filter which ops are allowed and which are blocked.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152801
Approved by: https://github.com/masnesral
2025-05-07 14:36:50 +00:00
f393ee5ab5 Use torch.types.Device in device_interface.py (#152935)
This is just a clean-up change that I noticed was possible; it removes the duplicate `_device_t` type which had the same semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152935
Approved by: https://github.com/Skylion007
2025-05-07 13:20:10 +00:00
cyy
2f09e79142 Fix Codegen.cmake warning (#153023)
Fix
```
CMake Warning (dev) in cmake/Codegen.cmake:
  A logical block opening on the line

    /var/lib/jenkins/workspace/cmake/Codegen.cmake:393 (if)

  closes on the line

    /var/lib/jenkins/workspace/cmake/Codegen.cmake:401 (endif)

  with mis-matching arguments.
```
by removing the condition in `endif`.

We could instead fix it, however, that is not best practice.  For example, cmake_lint warns that, and CMake says
```
The optional <condition> argument is supported for backward compatibility only.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153023
Approved by: https://github.com/aditew01, https://github.com/Skylion007
2025-05-07 12:45:20 +00:00
48bfe9afc7 has_triton: Use the device interface for detecting Triton availability (#139171)
This PR replaces the `has_triton()` global method which was previously used for this task.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel, https://github.com/shink
2025-05-07 12:23:10 +00:00
56879f64a8 [Break XPU] Fix XPU UT failures introduced by community. (#152945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152945
Approved by: https://github.com/Skylion007, https://github.com/EikanWang
2025-05-07 08:01:31 +00:00
5c878d4b04 [c10d][fr] Decouple the core logic of FR with the entry and event type (#152585)
We want to make FR generic enough so the first step is to make the FR a template struct so that most of common code logic can be reused. The reason for this is that CudaEvent does not inherit c10::Event and we just want to swap the event part so that for NCCL we use CudaEvent and for the rest of backends, we use c10::event.

Differential Revision: [D74262695](https://our.internmc.facebook.com/intern/diff/D74262695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152585
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
2025-05-07 06:21:33 +00:00
93a0a7a0bf Fix bug visualizing 1D Tensor using rich (#152871)
Fixes https://github.com/pytorch/pytorch/issues/152848

I didn't fix the bug earlier because the example script didn't exhaustively present all combinations of 1D/2D tensor, 1D/2D mesh, and all possible sharding specs. Therefore, in this PR, I enriched the example script to cover all possible combinations.

<img width="1008" alt="f" src="https://github.com/user-attachments/assets/1745a804-a004-4f98-8332-d7498453f397" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152871
Approved by: https://github.com/wanchaol
2025-05-07 06:04:22 +00:00
bb9fbb294a [Testing] Add logic for running MPS tests (#153012)
Prep change for getting rid of `_mac-test-mps.yml`
A complete no-op for now, but will be used by PR above the stack, but they should be landed few days apart to avoid forcing lots of people to rebase their PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153012
Approved by: https://github.com/wdvr
2025-05-07 04:27:31 +00:00
ae1e51b6ad Add infra to run CPython tests under Dynamo (#150787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787
Approved by: https://github.com/zou3519
2025-05-07 04:03:14 +00:00
13fbf21a76 [nativert] Port string join and split to c10/util (#152873)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
Port string utils functions join and split to c10/util

Test Plan:
Added tests in `string_util_test.cpp`
buck2 run mode/opt caffe2/c10/test:util_base_tests

Differential Revision: D74202473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152873
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-07 03:58:11 +00:00
5796212d48 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/misc.py [1/2] (#152274)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/misc.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152274
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-05-07 03:37:24 +00:00
cyy
ab997d9ff5 Pass UNINSTALL_DILL to docker build (#152792)
`UNINSTALL_DILL` was not really passed to docker before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152792
Approved by: https://github.com/wdvr
2025-05-07 03:17:45 +00:00
dfcfad2112 [c10d] Fix unused group input argument in new_subgroups() (#152765)
Summary: This diff fixes an unused input argument [`group`](8faa225695/torch/distributed/distributed_c10d.py (L5341)) in the `new_subgroups()` function.

Test Plan: contbuild & OSS CI, see

Differential Revision: D74132537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152765
Approved by: https://github.com/wz337
2025-05-07 02:37:51 +00:00
ecd74c953f [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-07 02:36:44 +00:00
1965a2ca1e [dynamo][ez] Remove unused guard OBJECT_MUTATION. (#152855)
Summary: seems not used anywhere https://www.internalfb.com/code/search?q=case%3Ayes%20filepath%3Acaffe2%20OBJECT_MUTATION

Test Plan: CI

Differential Revision: D74196559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152855
Approved by: https://github.com/jansel, https://github.com/jbschlosser
2025-05-07 02:32:32 +00:00
81b6920c68 [aoti] skip input symbol codegen for sympy expr w/ many symbols (#152579)
Issue was that
- symbol-ids appeared out-of-order w.r.t to the order of the forward inputs
```
def forward(arg0 # [(s3 - 1) + s4, 32], arg1 #[(s3 - 1)] ..)
```
- this causes codegen to fail because it expects all the base symbols `s4,s3` to have been codegen-ed already.
- well, we can skip codegen-ing sympy expr with many symbols e.g. `(s3 - 1) + s4` because `s3` and `s4` will be codegen-ed by other inputs.

```
# for example
s3 = arg1.size(0) + 1
s4 = argN.size(0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152579
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-05-07 01:18:09 +00:00
60ecc560af [export] Add draft-export docs (#152637)
Sample page: https://docs-preview.pytorch.org/pytorch/pytorch/152637/draft_export.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152637
Approved by: https://github.com/zou3519, https://github.com/svekars
2025-05-07 01:12:45 +00:00
a28dcdba2c Revert "[aot][ca] save bw_module in AOTAutogradCache (#151860)"
This reverts commit 613bd462721f3246888030de0a3f6932d52f515a.

Reverted https://github.com/pytorch/pytorch/pull/151860 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:54 +00:00
f6db749e60 Revert "[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731)"
This reverts commit 18229a5300a61b2d76ca95bee8ae8d4f4d5fa938.

Reverted https://github.com/pytorch/pytorch/pull/151731 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:54 +00:00
8f208dc75a Revert "[ca] hide unused scalar int sizes from dynamo (#151962)"
This reverts commit 4555ed8c83b47c450e31f1192e1f0fc4147d435f.

Reverted https://github.com/pytorch/pytorch/pull/151962 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:53 +00:00
64bbf58fb4 Revert "[dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119)"
This reverts commit 7aebb127bf309658770be93b264d4009c20a7f40.

Reverted https://github.com/pytorch/pytorch/pull/152119 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:53 +00:00
56492bfcb9 [MPS] SDPA specialized kernels (#152781)
Paritally fixes #139668 and #152550

Still work in progress. Following needs to be addressed:
- [x] Some tests are failing and need to check why and bugfix
- [x] Benchmark the new kernels and  add to this PR for varying sequence lengths head dimensions(the ones that get dispatched to kernels)
- [x] Add tests to cover the specialized paths(if applicable)
- [x] Code cleanup

**Tested on Macbook M1 Pro**
### Vector Fast Path (q_len=1, k_len=256)
- Old: 0.378 ms
- New: 0.260 ms
- **31.2% speed improvement**

### Vector 2-pass (q_len=1, k_len=4096)
- Old: 0.627 ms
- New: 0.370 ms
- **41.0% speed improvement**

### Vector Fast Path (q_len=8, k_len=256)
- Old: 0.545 ms
- New: 0.322 ms
- **40.9% speed improvement**

### Vector 2-pass (q_len=8, k_len=4096)
- Old: 1.318 ms
- New: 1.057 ms
- **19.8% speed improvement**

Script to get perf:
```
import torch
import time

def benchmark_sdpa(config, iterations=100):
    device = config.get("device", "cpu")
    batch = config["batch"]
    heads = config["heads"]
    q_len = config["q_len"]
    k_len = config["k_len"]
    head_dim = config["head_dim"]

    q = torch.randn(batch, heads, q_len, head_dim, device=device, dtype=torch.float32)
    k = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)
    v = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)

    for _ in range(5):
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()

    total_time = 0.0
    for i in range(iterations):
        start = time.perf_counter()
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()
        end = time.perf_counter()
        total_time += end - start

    avg_time = total_time / iterations
    print(f"[{config['name']}] Avg time per run: {avg_time * 1000:.3f} ms over {iterations} iterations")
    return avg_time

def main():
    device = "mps" if torch.backends.mps.is_available() else "cpu"
    print(f"Running benchmarks on device: {device}")

    benchmarks = [
        {
            "name": "Vector Fast - Small q_len & moderate k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length triggers vector fast path
            "k_len": 256,    # moderate key length
            "head_dim": 64,
            "device": device,
        },
        {
            "name": "Vector 2-pass - Small q_len & long k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length
            "k_len": 4096,   # long key length triggers the 2-pass variant
            "head_dim": 64,
            "device": device,
        },
        # {
        #     "name": "Full Attention - Moderate q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # longer query sequence length
        #     "k_len": 8192,    # matching key length for full attention paths
        #     "head_dim": 64,
        #     "device": device,
        # },
        # {
        #     "name": "Full Attention - Longer q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # very long sequence length
        #     "k_len": 8192,
        #     "head_dim": 64,
        #     "device": device,
        # },
    ]

    iterations = 100
    for config in benchmarks:
        benchmark_sdpa(config, iterations=iterations)

if __name__ == "__main__":
    main()

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152781
Approved by: https://github.com/malfet
2025-05-07 00:40:11 +00:00
2b2b790908 [Dynamo] Guard serialization for CONSTANT_MATCH (#152724)
This PR adds testing only; no non-test changes were needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152724
Approved by: https://github.com/jansel
ghstack dependencies: #152704
2025-05-07 00:36:39 +00:00
d2935a9f85 [CI] Upgrade sccache to 0.10.0 (#152957)
Newest release handles cuda better, and I think this fixes the cases I saw where some cuda related builds weren't being cached correctly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152957
Approved by: https://github.com/malfet
2025-05-07 00:33:43 +00:00
6d1e8994d3 [Dynamo] Guard serialization for EQUALS_MATCH (#152704)
This PR:
* Makes no changes to non-test code to support serialization for EQUALS_MATCH
* Adds test logic involving a custom-defined constant type to trigger the guard installation here:

72337bdcf2/torch/_dynamo/variables/user_defined.py (L792)

Q: Is there a better way to trigger installation of this guard or is this sufficient?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152704
Approved by: https://github.com/jansel
2025-05-07 00:28:31 +00:00
9919d6b872 [Testing] Add copysign from scalar regression test (#152997)
But instead of adding it just for MPS backend, add it to OpInfo

Fixes https://github.com/pytorch/pytorch/issues/152582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152997
Approved by: https://github.com/wdvr
2025-05-07 00:19:42 +00:00
327d1b6ef0 Move additional MPS Unary ops to Iterator (#152876)
Noticed some of these ops were contributing to a big chunk of the runtime for OpenLLama as well as a few other benchmarks

At the op level, moving to a TensorIterator-based Metal kernel gives a 20x speedup. Will migrate the inverse trigonometric functions & log ops in a follow-up PR, as this one is already a bit large
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152876
Approved by: https://github.com/malfet
2025-05-07 00:06:54 +00:00
61aa77e216 [cutlass backend][BE][clean-up] refactor to remove use of autotune_fallback_to_aten=True in cutlass backend tests (#152850)
Differential Revision: [D74192001](https://our.internmc.facebook.com/intern/diff/D74192001/)

Motivation: clean up post https://github.com/pytorch/pytorch/issues/147479. I plan to leave the rest of the clean-up as an first time issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152850
Approved by: https://github.com/chenyang78
2025-05-06 23:48:57 +00:00
5fa5017479 [ONNX] Suggest users setting dynamo=True when exporting (#152478)
Fixes #152025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152478
Approved by: https://github.com/justinchuby
2025-05-06 23:18:11 +00:00
80d2116405 [BE] Update numba versions (#152557)
Let's see if PyTorch is compatible with latest
`test_unary_funcs` are no longer failing due to https://github.com/pytorch/pytorch/pull/148024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152557
Approved by: https://github.com/Skylion007
2025-05-06 23:15:21 +00:00
911b838aae [Memory Viz] Add Compile Context to Visualizer (#152862)
Summary: Adds PT2 info to visualizer. Also makes sure we have a case when compile context is not in pickle file.

Test Plan: {F1977637362}

Differential Revision: D74202811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152862
Approved by: https://github.com/aaronenyeshi
2025-05-06 23:09:59 +00:00
6c025b5a82 [dynamo] Support delattr on result of torch.compile(module) (#152741)
This is essentially a follow-up on #122098, where we added support of
`getattr` and `setattr` on result of `torch.compile(module)`, but didn't
add support for `delattr`.

Fixes #150711.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741
Approved by: https://github.com/anijain2305
ghstack dependencies: #152740
2025-05-06 22:30:37 +00:00
0886d402f1 [dynamo] Avoid running torch.nn.Module.__call__ twice under torch.compile(mod) (#152740)
When we do `torch.compile(mod)`, we eventually end up returning a new
module instance, whose `forward` method is the result of
`torch.compile(mod.__call__)`, meaning it already captures all the extra
logic (e.g., hook firing) from the default `torch.nn.Module.__call__`.
As a result we can't reuse the inherited default `__call__` as is,
because we'd end up running the logic twice.

This patch makes the returned `OptimizedModule` override the default
`__call__`, and directly calls into its compiled `forward` method.

Fixes #149502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740
Approved by: https://github.com/anijain2305
2025-05-06 22:30:37 +00:00
1c30862d8f Partilally revert https://github.com/pytorch/pytorch/pull/152288 (#152909)
Summary: As it results in build failures for some internal targets that stuck on older compiler. Platform update is tracked in [T223408150](https://www.internalfb.com/tasks?t=223408150)

Test Plan: CI

Differential Revision: D74220384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152909
Approved by: https://github.com/cyyever, https://github.com/wdvr
2025-05-06 22:02:42 +00:00
5fe58ab5bd Devcontainer: Optimize apt-get commands to reduce Docker image size (#152882)
## Summary
- Added --no-install-recommends flag to all apt-get install commands to reduce unnecessary dependencies
- Added apt-get clean after package installations to remove package cache and reduce image size
- Combined multiple apt commands into single instructions to reduce Docker image layers

## Test plan
Test by building the devcontainer and verifying functionality while ensuring reduced image size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152882
Approved by: https://github.com/cyyever, https://github.com/atalman, https://github.com/Skylion007
2025-05-06 20:33:02 +00:00
ed63cb20ec [ROCm] Fix SymmetricMemory build error on NAVI arch (#152838)
NAVI arch doesn't support `__builtin_amdgcn_s_memtime()`, using `clock64()` instead which works for both NAVI and MI archs.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152838
Approved by: https://github.com/jeffdaily
2025-05-06 19:37:58 +00:00
8faa0b18c3 [ROCm] opportunistic fastatomics - fix build error with newer compilers (#152841)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152841
Approved by: https://github.com/jeffdaily
2025-05-06 19:37:48 +00:00
1f4f4a61c2 Devcontainer: Replace conda with apt-based setup (#152881)
## Summary
- Replaced miniconda base image with base Ubuntu 22.04 image
- Installed Python and required dependencies using apt
- Replaced conda-based CUDA installation with apt-based version
- Updated paths in install-dev-tools.sh to reflect the new non-conda environment
- Removed conda-specific files and added requirements.txt for Python dependencies

## Test plan
Test by building and running the devcontainer in VS Code with both CPU and CUDA configurations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152881
Approved by: https://github.com/atalman
2025-05-06 19:23:58 +00:00
200df50c05 Devcontainer: Fix context path and workspace mount (#152880)
## Summary
- Changed the devcontainer context path from '../..' to './' for both CPU and CUDA configurations
- Added workspace mount configuration to properly mount the repository in the container
- Added containerEnv to disable implicit --user pip flag

## Test plan
Test by building and running the devcontainer in VS Code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152880
Approved by: https://github.com/atalman
2025-05-06 19:22:29 +00:00
08f5371571 [float16]: Fix the accumulation type for dot and gemv (#152676)
Fixes #147860

Also, partially address: https://github.com/pytorch/pytorch/issues/125438

Use float32 for accumulation with float16 and and bfloat16 types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152676
Approved by: https://github.com/malfet
2025-05-06 18:10:08 +00:00
7a0781eaad Improve cache key graph printing performance (#151928)
Teach the graph printer how to allow overriding printing SymTypes (`SymInt`, `SymFloat`, `SymBool`) and then use that to reuse the fast SymNode printing from `torch._inductor.utils.sympy_str()` to make computing the cache key faster.

On my computer the repro from #151823 goes from 480s -> 80s (still terrible... but better).

Fixes #151823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151928
Approved by: https://github.com/laithsakka
2025-05-06 17:39:53 +00:00
7dd9d514d2 [Graph Partition] remove PRECOMPUTED_SIZE from partition symbol inputs (#152864)
PRECOMPUTED_SIZE is computed during runtime and should not be included in graph_partition_inputs. See the following example for a PRECOMPUTED_SIZE `ps0`.

![image](https://github.com/user-attachments/assets/5aa949a9-b8e0-4b77-8702-95b96b58694e)

full output code: [P1803820480](https://www.internalfb.com/phabricator/paste/view/P1803820480)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152864
Approved by: https://github.com/eellison
2025-05-06 17:35:29 +00:00
5d36485b4a Log aot and idx waitcounters. (#152444)
Summary:
Added for create_aot_dispatcher_function and compile_fx_inner.

Note:
Log wait counters flag is already set for:
1. async_compile.precompile
2. remote_fx_graph_cache_get
3. remote_fx_graph_cache_put

Test Plan: contbuild

Differential Revision: D73866124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152444
Approved by: https://github.com/ppanchalia, https://github.com/masnesral
2025-05-06 16:21:58 +00:00
07a29dbe81 [BE]: Update cutlass submodule to 3.9.2 (#152779)
A lot of last minute bugfixes for CUTLASS blackwell that we should upstream. It's a header only library and a minor release so this should strictly improve compiler support and fix some bugs. Needed to update some instruction numbers in torch compile baselines for the new kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152779
Approved by: https://github.com/henrylhtsang
2025-05-06 16:08:24 +00:00
f56bcd2408 [precompile] [easy] Refactor FxGraphCache to add cache_hit_post_compile function (#152839)
This PR refactors CompiledFxGraph by adding a new post_compile step that only runs on cache hit. This refactors a bunch of code in _lookup_graph to its own function so that we can use it in BundledAOTAutogradCacheEntry. No difference in behavior here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152839
Approved by: https://github.com/oulgen
ghstack dependencies: #152836
2025-05-06 15:33:24 +00:00
a8f727c439 [c10d] Fix extra CUDA context created by barrier (#149144)
Fixes #149119.

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses `device_id` given by user
when calling `init_process_group`.

This PR also uses `torch._C._get_accelerator()` to determine the device
type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever
2025-05-06 15:27:30 +00:00
12a8b70247 [precompile] Refactor AOTAutogradCacheEntry to be generic (#152836)
The purpose of this stack is to create a new BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that is self contained, i.e. it contains all of the CompiledFxGraph directly in the entry, instead of relying on FxGraphCache._lookup_graph.

Because this woudl balloon the size of the actual cache entry to do this, our goal is not to use BundledAOTAutogradCacheEntry in cache scenarios: only for precompile use cases. Thus, it's important we make this whole setup generic, to be able to support these two workflows clearly.

This PR genericizes AOTAutogradCacheEntry considerably, so that it can take in different types of Forwards and Backwards.

Each GenericAOTAutogradCacheEntry is composed of two parts, a TForward and a TBackward. The forward and backward can be loaded in multiple ways, either via FxGraphCache._lookup_graph, or by saving the entire CompiledFxGraph.

For simplicify, this PR only implements the generic code refactors needed, but does not fully implement BundledAOTAutogradCacheEntry, which is an AOTAutogradCacheEntry that takes a full CompiledForward. We'll handle and implement BundledAOTAutogradCacheEntry in the PR above this, for easier review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152836
Approved by: https://github.com/oulgen
2025-05-06 15:19:17 +00:00
fcd5e49138 Revert "[dynamo] Recursively realize the stack_values (#152853)"
This reverts commit 460888f908ea4b634ecc863a6da6b2132108bc79.

Reverted https://github.com/pytorch/pytorch/pull/152853 on behalf of https://github.com/malfet due to Looks like it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/152853#issuecomment-2854897485))
2025-05-06 15:02:57 +00:00
f47bf38e30 [float16]: Fast path for torch.dot with float16/bfloat16 (#152799)
Fixes #152798

Add the fast path for dot with contiguous tensors for float16/bfloat16 types.

Performance with patch (see issue for benchmark and current performance):

![Improved dot performance](https://github.com/user-attachments/assets/57f64e90-8191-4710-adb0-f430644827de)

**We see up to 10x+ improvement in performance.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152799
Approved by: https://github.com/malfet
2025-05-06 14:59:27 +00:00
b06cbd49f1 [Dynamo] Guard serialization for TENSOR_SUBCLASS_METADATA_MATCH (#152626)
This PR updates `GuardsStatePickler.reducer_override()` in `torch/_dynamo/guards.py` to handle reconstruction of traceable wrapper subclasses. It's intended to work recursively and handle any level of subclass instance nesting (e.g. subclass instances that contain subclass instances, etc.)

This PR tests the guard on several traceable wrapper tensor subclasses:
* `LocalSubclass`: used to ensure the correct error message is thrown when the subclass is not defined globally
* `torch.testing._internal.two_tensor.TwoTensor`: defines None for its extra metadata
* `SubclassWithMeta`: stores non-trivial extra metadata
* `SubclassWithCustomMetadataGuard`: stores non-trivial extra metadata and defines a custom `__metadata_guard__` classmethod
* `SubclassWithSubclassInnerTensors`: used to test recursiveness; this subclass contains subclass inner tensor components

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152626
Approved by: https://github.com/jansel
2025-05-06 14:06:36 +00:00
199d5a408a [partitioner] Fix argument to _broadcast_on_rank0 (#152846)
Summary:
There was a bug when I refactored my original implementation.

This should fix it

Test Plan: Run on some internal workloads

Differential Revision: D74190485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152846
Approved by: https://github.com/danthe3rd
2025-05-06 13:45:59 +00:00
bc11afd41f [Inductor] FX backend via Wrapper IR (#146942)
# Sub-PRs

These PRs contain refactors from the main one. They should be reviewed and merged first.

- https://github.com/pytorch/pytorch/pull/150458
- https://github.com/pytorch/pytorch/pull/152391
- https://github.com/pytorch/pytorch/pull/152587

# Feature

The goals of this PR are twofold.

## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen.

In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components.

This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR.

## Goal 2: Convert Wrapper IR into FX IR.

One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc.

It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes.

The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source.

Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa.

# Current status

Things that seem to work:

- Converted a lot of the most common Python codegen lines to Wrapper IR lines.
     - Handled the following cases, in addition to what was already in the Memory Planning IR:
         - Comments
         - Triton kernels
         - Extern/fallback kernels
         - Freeing tensors (`del buf0`)
         - MultiOutput
         - Graph outputs
         - ReinterpretView / StorageBox, for both call args and outputs.
     - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code.
- Prototype FX converter which can handle some of the most common use cases.
   - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565).
   - Calling wrapped Triton kernels.
   - Calling extern kernels and certain types of fallback kernels.
       - Support both `extern_kernels.*` and `aten.*`.
       - Support multi-output kernels like `torch.topk`.
   - Graphs with multiple inputs/outputs.
   - Training i.e. calling `Tensor.backward()` in a compiled function.
   - Graph breaks (training).
- Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen.

Things that don't work:
- Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections.
         - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX.
         - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR.
         - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this.

# Out-of-tree compilers

With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below.

```
from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen

class MyCustomBackend(WrapperFxCodegen):
     def compile_graph(self, gm):
         # Add 1 to the graph's outputs
         def compiled_fn(*args):
             return [x + 1 for x in gm.graph.forward(*args)]
         return compiled_fn
```

# Example FX graphs

This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`.

Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table.
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    return (buf0,)
```

Here's a more complicated graph that calls a `torch.addmm` extern kernel.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}})
    %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    return (buf2,)
```

Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {})
    %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {})
    %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}})
    %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}})
    return (buf1, buf2)
```

Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {})
    return (buf0_view_buf0_0,)
```

Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`.
```
graph():
    %s6 : [num_users=0] = placeholder[target=s6]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}})
    return buf0
```

Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions.
```
graph():
    %s10 : [num_users=0] = placeholder[target=s10]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s10**2)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s10**2, XBLOCK: 64}})
    return buf0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942
Approved by: https://github.com/jansel
2025-05-06 10:06:39 +00:00
e32a16a9da Correct torch.xpu.is_bf16_supported return False if no XPU detected (#152317)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/152301
When XPU is not available, calling `torch.xpu.is_bf16_supported()` still returns `True`, which is inconsistent with the expected behavior (should be False).

# Solution
Align to other backend, adding `including_emulation` to `torch.xpu.is_bf16_supported` and,
- return `False` if XPU is not available
- return `True` if `including_emulation` is True
- return `torch.xpu.get_device_properties().has_bfloat16_conversions` if `including_emulation` is False, it means if the device could generate SPIRV code for bf16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152317
Approved by: https://github.com/EikanWang
2025-05-06 10:03:17 +00:00
8904ba6387 Forward fix D74196435 (#152926)
Summary: Forward fix a misplace declaration from D74196435

Test Plan: Random check with a failed build `buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/opt fbcode//accelerators/workloads/models/emu_flash/tests:test_compile_eager`

Reviewed By: wdvr

Differential Revision: D74225582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152926
Approved by: https://github.com/cyyever, https://github.com/wdvr
2025-05-06 07:33:38 +00:00
689e14ae00 [NFC] [inductor] [compile async] Warn exception if pickler failed (#152401)
A NFC to help us to find issues

See https://github.com/pytorch/pytorch/issues/151904

CC @aorenste

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152401
Approved by: https://github.com/Skylion007
2025-05-06 07:12:35 +00:00
1dd36ad2d4 Fix conditional git diff in _link_check.yml (#152919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152919
Approved by: https://github.com/huydhn
2025-05-06 07:01:45 +00:00
0e2b948256 Revert "cleanup, refactor and add missing self._dde_suppressed checks (#152657)"
This reverts commit 784c666cae00f85ecf675298ddb056bebaf32f55.

Reverted https://github.com/pytorch/pytorch/pull/152657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a test to fail in trunk ([comment](https://github.com/pytorch/pytorch/pull/152657#issuecomment-2853442594))
2025-05-06 06:45:07 +00:00
451d652873 Revert "Make device check error message more descriptive (#150750)"
This reverts commit 8253970a1f90a5b0b1fe0d4febd949470f6fa265.

Reverted https://github.com/pytorch/pytorch/pull/150750 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a test to fail in trunk ([comment](https://github.com/pytorch/pytorch/pull/150750#issuecomment-2853438985))
2025-05-06 06:42:08 +00:00
460888f908 [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-06 06:30:31 +00:00
dd766e1dc5 [audio hash update] update the pinned audio hash (#152885)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152885
Approved by: https://github.com/pytorchbot
2025-05-06 05:29:25 +00:00
784c666cae cleanup, refactor and add missing self._dde_suppressed checks (#152657)
so two things other than cleanups and refactoring
1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false .
2) do not guard for dimensions of type  DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657
Approved by: https://github.com/pianpwk
2025-05-06 05:24:09 +00:00
e2eb845313 [ez] fix a bunch of typos in dynamo (#152886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152886
Approved by: https://github.com/williamwen42
2025-05-06 05:13:56 +00:00
37c71820f3 Fix nn.LazyModuleMixin examples (#150596)
Fixes #150404

## Test Result

![image](https://github.com/user-attachments/assets/e546339f-c1cb-47db-ab0e-276a42c167b8)

![image](https://github.com/user-attachments/assets/298db7ad-6512-4b17-9453-170ff843c4fd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150596
Approved by: https://github.com/mikaylagawarecki
2025-05-06 05:11:22 +00:00
337895eaaf Run url and xref linters independently (#152899)
Also introduce `skip-xref-lint` label

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152899
Approved by: https://github.com/huydhn
2025-05-06 05:02:32 +00:00
ee0cd1d8b5 Only do shallow clone when checkout nccl (#152826)
Note: `--depth` implies `--single-branch` since git 2.7.6

```sh
git clone https://github.com/NVIDIA/nccl.git
Cloning into 'nccl'...
remote: Enumerating objects: 4205, done.
remote: Counting objects: 100% (238/238), done.
remote: Compressing objects: 100% (122/122), done.
remote: Total 4205 (delta 144), reused 126 (delta 116), pack-reused 3967 (from 3)
Receiving objects: 100% (4205/4205), 4.22 MiB | 7.01 MiB/s, done.
Resolving deltas: 100% (2858/2858), done.
```
```sh
git clone --depth 1 --branch v2.25.1-1 https://github.com/NVIDIA/nccl.git
Cloning into 'nccl'...
remote: Enumerating objects: 249, done.
remote: Counting objects: 100% (249/249), done.
remote: Compressing objects: 100% (227/227), done.
remote: Total 249 (delta 31), reused 111 (delta 15), pack-reused 0 (from 0)
Receiving objects: 100% (249/249), 657.44 KiB | 2.14 MiB/s, done.
Resolving deltas: 100% (31/31), done.
Note: switching to '80f6bda4378b99d99e82b4d76a633791cc45fef0'.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152826
Approved by: https://github.com/albanD
2025-05-06 04:56:19 +00:00
97dfd8dd53 [invoke_subgraph] Run missing graph passes recursively (#152675)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152675
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #152772, #152770
2025-05-06 02:55:34 +00:00
cc254eaa7c [inductor][refactor] Refactor the fetching of subgraph names (#152770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152770
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #152772
2025-05-06 02:55:34 +00:00
b1d34acac5 [fx] Recursive DCE on subgraphs (#152772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152772
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-05-06 02:55:34 +00:00
35c727e7ff Fix typo on test_multi_device_context_manager for XPU (#152812)
# Motivation
Align https://github.com/pytorch/pytorch/pull/152474, fix the typo on UT for XPU introduced by https://github.com/pytorch/pytorch/issues/148864
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152812
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
2025-05-06 02:51:19 +00:00
470cd3a995 [aotinductor] Don't alloc weights if they don't exist (#152692)
Fixes https://github.com/pytorch/pytorch/issues/152356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152692
Approved by: https://github.com/henrylhtsang
2025-05-06 02:50:21 +00:00
8253970a1f Make device check error message more descriptive (#150750)
Fixes #122757

## Test Result

```python
import torch

model_output = torch.randn(10, 5).cuda()
labels = torch.randint(0, 5, (10,)).cuda()
weights = torch.randn(5)

loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
loss = loss_fn(input=model_output, target=labels)
print(loss)

Traceback (most recent call last):
  File "/home/zong/code/pytorch/../loss2.py", line 17, in <module>
    loss = loss_fn(input=model_output, target=labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy
    return torch._C._nn.cross_entropy_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750
Approved by: https://github.com/mikaylagawarecki
2025-05-06 02:33:20 +00:00
1d7728056b [nativert] Move TensorMeta to pytorch core (#152475)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/`

Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later.

Test Plan:

Added test under `test/cpp/nativert/test_tensor_meta.cpp`

Differential Revision: D73820548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475
Approved by: https://github.com/albanD
2025-05-06 01:50:46 +00:00
1798b0db25 Use three-dot diffs in URL and xref lint workflows (#152895)
Only run on the files actually modified in a PR, not every file touched on main since the branch point

Fixes #152884

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152895
Approved by: https://github.com/huydhn
2025-05-06 01:33:52 +00:00
f097e83369 [inductor][retry] Realize bucketize/searchsorted output (#152858)
**Context**:
bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro:

https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554

shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\*\*15 binary searches, the fused version does 2\*\*15 * 384 - one for each of the broadcasted outputs.

**Solution**:
Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on.

After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled).

**Retry**
Original PR (https://github.com/pytorch/pytorch/pull/152644) was reverted due to internal bisect blaming this change, but the bisect was a false positive (and is marked as such)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152858
Approved by: https://github.com/aakhundov
2025-05-06 01:32:26 +00:00
14f8066910 Ensure mxfp8 scaled_mm works w/ max-autotune (#152744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152744
Approved by: https://github.com/Skylion007
2025-05-06 01:16:57 +00:00
cyy
ac792a0dca [submodule] Bump ITTAPI to 3.25.5 (#150263)
It hasn't been updated for 3 years. And also to remove CMake 4 workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150263
Approved by: https://github.com/sraikund16
2025-05-06 01:02:18 +00:00
721fdfa32d [ez] Fsspec Filesystem ls details should be false (#152693)
Summary: The default action for ls for the local filesystem is with details=False, but this isn't the case for all filesystems (eg. huggingface), so setting details=False explicitly so that the return type of ls is a list of strings, and not a list of dictionaries, which is what it would be with details=True.

Test Plan: tested in notebook

Differential Revision: D74080572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152693
Approved by: https://github.com/joecummings
2025-05-06 01:02:13 +00:00
4979ca5ffa Synchronize in foreach tests after profiling (#152857)
After the CI change from 12.4 -> 12.6 around mid-March, the foreach tests have been flaky and hard to repro due to nondeterminism. Per @davidberard98's suggestion, let's try to add a synchronize before checking profiler results to see whether this fixes the flake! The hope is that the 48 currently open foreach flaky issues will close from this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152857
Approved by: https://github.com/davidberard98
2025-05-06 00:56:48 +00:00
13dcf80a53 [dynamic shapes] use try-catch instead of guard_or_true for reshape_view_helper (#152638)
Test Plan: test_export

Differential Revision: D74033649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152638
Approved by: https://github.com/laithsakka
2025-05-06 00:54:24 +00:00
d197228d43 Revert "[CI] Use cmake from pip instead of conda in CI docker images (#152537)"
This reverts commit 3196a3aca0f16792820158cfd451cb977f99ac7e.

Reverted https://github.com/pytorch/pytorch/pull/152537 on behalf of https://github.com/huydhn due to We need signals from inductor, cmake version from pip is too old? ([comment](https://github.com/pytorch/pytorch/pull/152537#issuecomment-2852820175))
2025-05-06 00:22:23 +00:00
103fe856e1 Revert "Add infra to run CPython tests under Dynamo (#150787)"
This reverts commit 7c96dd8f0c9a7e17f598612405f002441c7f07ae.

Reverted https://github.com/pytorch/pytorch/pull/150787 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a failed test is showing up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150787#issuecomment-2852818113))
2025-05-06 00:20:02 +00:00
0e9874849f [BE]: Update torch core lazy helpers with micropts (#152778)
Some minor nits I noticed. Use reserve when possible
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152778
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-05-06 00:03:51 +00:00
fd57c16285 Avoid triggering ignored requires_grad warning in our code (#152686)
This one is ok to silence as we're just doing formatting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152686
Approved by: https://github.com/Skylion007
2025-05-05 23:56:40 +00:00
125a3eee5c [ez] Use pip instead of conda in run_tests.sh (#152860)
Part 1 of https://github.com/pytorch/pytorch/issues/148336.  The rest depends on https://github.com/pytorch/pytorch/issues/148335 to remove conda from Docker build process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152860
Approved by: https://github.com/atalman
2025-05-05 23:06:55 +00:00
e3064bf0e3 [inductor] Allow num_program specification for TMA workspace (#152844)
Summary:
Allow TMA workspace creation allow specification for `num_programs`, which defaults to `num_sms` when not specified.

We need a total `num_programs * num_tma_descriptors` no. of descriptors for a kernel.

Test Plan: CI.

Differential Revision: D74189599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152844
Approved by: https://github.com/drisspg
2025-05-05 23:02:55 +00:00
cc954848d4 Revert "[c10d] Fix extra CUDA context created by barrier (#149144)"
This reverts commit 457fa820ad538c7aeadb68f0ec418d63972ba1ee.

Reverted https://github.com/pytorch/pytorch/pull/149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/149144#issuecomment-2852564660))
2025-05-05 22:56:50 +00:00
2ce6d169fc [IR] Input Adapter refactor prototype (#152459) (#152575)
Summary:

1. Adding `input` field to `_adapt_flat_args` function
2. In `process_forward_inputs`, `reorder_kwargs` will now do nothing if no kwargs are provided (previously would error)
3. Pass `args` as input to `_adapt_flat_args`

These changes are made to update the InputAdapter

see more context in D73811508

Test Plan: see D73811508

Differential Revision: D73945419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152575
Approved by: https://github.com/angelayi
2025-05-05 22:51:58 +00:00
a2ccda3c60 [pytorch][PR][inductor] Fix one instance of launch_enter_hook (#152831)
Summary: One usage seems missed in https://github.com/pytorch/pytorch/pull/152457

Test Plan: EMS local benchmark

Differential Revision: D74159749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152831
Approved by: https://github.com/danzimm
2025-05-05 22:15:47 +00:00
2b4fe9fa14 [Autotune Cache] Fix the bug of using the wrong key for recording artifacts in CacheArtifactManager (#152678)
Summary: Replace the key (path) from `<hash>.best_config` to `<parent_dir>/<hash>.best_config` to ensure that Autotune artifacts in MegaCache are loaded to the correct location locally.

Test Plan: NA

Differential Revision: D74052400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152678
Approved by: https://github.com/oulgen
2025-05-05 21:03:10 +00:00
d547c7e10d [fbgemm] Implement __obj_flatten__ for LinearPackedParamsBase (#152619)
Differential Revision: D73991241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152619
Approved by: https://github.com/jerryzh168, https://github.com/houseroad
2025-05-05 20:58:25 +00:00
22d1359bc6 Move warning from item to specific number conversions (#152709)
Follow up to https://github.com/pytorch/pytorch/pull/143261 to not warn when a plain .item() is done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152709
Approved by: https://github.com/malfet, https://github.com/ngimel
2025-05-05 20:46:05 +00:00
3bc69cc08d Document that dampening is skipped in SGD momentum first step (#152833)
Pointed out by https://x.com/hi_tysam/status/1917318692276174977/photo/2.

It would be BC breaking to change this behavior 7 years after it has been decided, so we are documenting it first at the very least.

<img width="642" alt="image" src="https://github.com/user-attachments/assets/3febcb07-e0ed-44a1-bd3b-a8e685711cb4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152833
Approved by: https://github.com/albanD
2025-05-05 20:07:23 +00:00
99dac7005f Revert "[Inductor] FX backend via Wrapper IR (#146942)"
This reverts commit a7691140a0fed33a838dda11e28ff7da393d9180.

Reverted https://github.com/pytorch/pytorch/pull/146942 on behalf of https://github.com/malfet due to Looks like it indeed breaks lint, see a7691140a0/1 ([comment](https://github.com/pytorch/pytorch/pull/146942#issuecomment-2852192778))
2025-05-05 20:01:29 +00:00
a7691140a0 [Inductor] FX backend via Wrapper IR (#146942)
# Sub-PRs

These PRs contain refactors from the main one. They should be reviewed and merged first.

- https://github.com/pytorch/pytorch/pull/150458
- https://github.com/pytorch/pytorch/pull/152391
- https://github.com/pytorch/pytorch/pull/152587

# Feature

The goals of this PR are twofold.

## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen.

In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components.

This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR.

## Goal 2: Convert Wrapper IR into FX IR.

One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc.

It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes.

The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source.

Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa.

# Current status

Things that seem to work:

- Converted a lot of the most common Python codegen lines to Wrapper IR lines.
     - Handled the following cases, in addition to what was already in the Memory Planning IR:
         - Comments
         - Triton kernels
         - Extern/fallback kernels
         - Freeing tensors (`del buf0`)
         - MultiOutput
         - Graph outputs
         - ReinterpretView / StorageBox, for both call args and outputs.
     - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code.
- Prototype FX converter which can handle some of the most common use cases.
   - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565).
   - Calling wrapped Triton kernels.
   - Calling extern kernels and certain types of fallback kernels.
       - Support both `extern_kernels.*` and `aten.*`.
       - Support multi-output kernels like `torch.topk`.
   - Graphs with multiple inputs/outputs.
   - Training i.e. calling `Tensor.backward()` in a compiled function.
   - Graph breaks (training).
- Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen.

Things that don't work:
- Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections.
         - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX.
         - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR.
         - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this.

# Out-of-tree compilers

With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below.

```
from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen

class MyCustomBackend(WrapperFxCodegen):
     def compile_graph(self, gm):
         # Add 1 to the graph's outputs
         def compiled_fn(*args):
             return [x + 1 for x in gm.graph.forward(*args)]
         return compiled_fn
```

# Example FX graphs

This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`.

Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table.
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    return (buf0,)
```

Here's a more complicated graph that calls a `torch.addmm` extern kernel.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}})
    %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    return (buf2,)
```

Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {})
    %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {})
    %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}})
    %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}})
    return (buf1, buf2)
```

Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {})
    return (buf0_view_buf0_0,)
```

Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`.
```
graph():
    %s6 : [num_users=0] = placeholder[target=s6]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}})
    return buf0
```

Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions.
```
graph():
    %s10 : [num_users=0] = placeholder[target=s10]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s10**2)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s10**2, XBLOCK: 64}})
    return buf0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942
Approved by: https://github.com/jansel
2025-05-05 19:34:49 +00:00
fdadda21b6 Revert "[float16]: Fast path for torch.dot with float16/bfloat16 (#152799)"
This reverts commit d57bf53225004a684952222722a4f7322a21a596.

Reverted https://github.com/pytorch/pytorch/pull/152799 on behalf of https://github.com/malfet due to This broke C10_MOBILE builds, not sure why it was not surfaced on pull, see a766c1d117/1 ([comment](https://github.com/pytorch/pytorch/pull/152799#issuecomment-2852084433))
2025-05-05 19:17:59 +00:00
a766c1d117 [nativert] move intrusive list to c10/util (#152754)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff moves intrusive list to c10/util

Test Plan: CI

Differential Revision: D74104595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152754
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-05 18:49:56 +00:00
51e77f3b30 [dynamo] replace unimplemented with unimplemented_v2 in variables/torch_functions.py (#151278)
This addresses part of #147913.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151278
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
ghstack dependencies: #151277
2025-05-05 18:45:40 +00:00
9e24f9b523 [dynamo] replace unimplemented with unimplemented_v2 in variables/functions.py (#151277)
This addresses part of #147913.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151277
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-05-05 18:45:40 +00:00
d57bf53225 [float16]: Fast path for torch.dot with float16/bfloat16 (#152799)
Fixes #152798

Add the fast path for dot with contiguous tensors for float16/bfloat16 types.

Performance with patch (see issue for benchmark and current performance):

![Improved dot performance](https://github.com/user-attachments/assets/57f64e90-8191-4710-adb0-f430644827de)

**We see up to 10x+ improvement in performance.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152799
Approved by: https://github.com/malfet
2025-05-05 18:29:39 +00:00
172a7c942e Revert "Log aot and idx waitcounters. (#152444)"
This reverts commit ea9ea029595a5f628fdd368a6e1dd76e95707161.

Reverted https://github.com/pytorch/pytorch/pull/152444 on behalf of https://github.com/jovianjaison due to needs a fix ([comment](https://github.com/pytorch/pytorch/pull/152444#issuecomment-2851905261))
2025-05-05 18:11:37 +00:00
136ee4c81b Make assertion about pass callable print the bad pass (#152654)
If you passed an invalid string now you can easily see what it is

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152654
Approved by: https://github.com/eellison
2025-05-05 18:07:43 +00:00
fd6d4a6a24 [dynamo] Guard serialization for DICT_KEYS_MATCH (#152723)
DICT_KEYS_MATCH

Differential Revision: [D74091886](https://our.internmc.facebook.com/intern/diff/D74091886/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152723
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687, #152716, #152721
2025-05-05 18:05:56 +00:00
2da9ab4b1c [dynamo] Guard serialization for MAPPING_KEYS_CHECK (#152721)
MappingProxyType

Differential Revision: [D74091363](https://our.internmc.facebook.com/intern/diff/D74091363/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152721
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687, #152716
2025-05-05 18:05:56 +00:00
24e1666b3a [dynamo] Guard serialization for WEAKREF_ALIVE (#152716)
Punt on WEAREF_ALIVE as weakref won't live across the process and users might need to drop them upfront.

Differential Revision: [D74088735](https://our.internmc.facebook.com/intern/diff/D74088735/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152716
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687
2025-05-05 18:05:56 +00:00
2cb16df6e2 [dynamo] Guard serialization for DUPLICATE_INPUT. (#152687)
Seems this guard is not very active. Adding a test to detect error handling at least.

Differential Revision: [D74074837](https://our.internmc.facebook.com/intern/diff/D74074837/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152687
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616
2025-05-05 18:05:56 +00:00
ffd58293f7 [dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616)
Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states.

## Test Cases:

0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed.
1. torch.compile() nested in vmap.
2. torch.compile() nested in grad.
3. torch.compile() nested in jvp + vmap
4. torch.compile() nested functionalize
5. torch.compile() nested in vmap + grad

Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616
Approved by: https://github.com/zou3519
ghstack dependencies: #152615
2025-05-05 18:05:56 +00:00
1d1cbcd8a3 [dynamo] Guard serialization for DUAL LEVEL. (#152615)
Seem dual level counter should be stored in OutputGraph so that the value can be preserved through roundtripping.

Differential Revision: [D74008786](https://our.internmc.facebook.com/intern/diff/D74008786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152615
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-05-05 18:05:56 +00:00
0145f9e29e [CI] docker images use tags instead of image name (#152209)
Change CI docker images to be `ci-image:<image name>-<folder sha>` instead of `<image name>:<folder sha>` so we never have to make a new ecr repo ever again

Pros:
never have to make a new ecr repo ever again
Cons:
if it aint broken, dont fix it?

Don't need to change linux-test images since they use the "full name" of the image with the docker registry and the tag

In order to prevent others needing to rebase past this PR, also push the image to the "old name".  This can be removed after this PR has been in main for a while
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152209
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-05-05 18:02:29 +00:00
cyy
45efa1aaa8 [3/N] Use internal linkage in C++ files (#151297)
Follows #151070.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297
Approved by: https://github.com/Skylion007
2025-05-05 17:48:39 +00:00
99287b170b Generate test reports for pytest when option is given (#152170)
The argument needs to be appended when test reports should be generated. IS_CI is not necessarily set, so rather check TEST_SAVE_XML instead as in other places where test reports are conditionally enabled.

See also https://github.com/pytorch/pytorch/issues/126523
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152170
Approved by: https://github.com/Skylion007
2025-05-05 17:46:40 +00:00
kyo
a21090a38c Fix incorrect citation of authors in documentation (#145209)
This PR corrects the citation of Adafactor authors "Noam Shazeer" and "Mitchell Stern" in the documentation.
The current text incorrectly lists them as "Shazeer, Noam, and Mitchell Stern," which seems to be a result of a data parsing issue of some reference manager(s) [as you can find many papers with the same issue](https://www.google.com/search?q=%22Shazeer%2C+Noam%2C+and+Mitchell+Stern%22).
The updated citation follows standard conventions for author names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145209
Approved by: https://github.com/janeyx99
2025-05-05 17:45:05 +00:00
ea9ea02959 Log aot and idx waitcounters. (#152444)
Summary:
Added for create_aot_dispatcher_function and compile_fx_inner.

Note:
Log wait counters flag is already set for:
1. async_compile.precompile
2. remote_fx_graph_cache_get
3. remote_fx_graph_cache_put

Test Plan: contbuild

Differential Revision: D73866124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152444
Approved by: https://github.com/ppanchalia, https://github.com/masnesral
2025-05-05 17:35:29 +00:00
35475a3e07 Disable SLEEF implementation of vec::maximum in vec128_float_neon.h | Accelerate aten::hardtanh_ by 21x (#152538)
The `has_inf_nan` implementation in `vec::maximum` is scalar, and it slows down certain activations like `tanh` by almost 20 times. Additionally, the `vec::minimum` function simply uses NEON intrinsics and not SLEEF. This PR makes the two fns similar in implementation.

Besides, the SLEEF function `Sleef_fmaxf4` ultimately invokes the `vmaxq_f32` NEON intrinsic through [vmax_vf_vf_vf](d28232a309/src/arch/helperadvsimd.h (L253)).

From a single threaded profile of mobilenet on an Arm Neoverse-V2 machine (code below), the `aten::hardtanh_` takes **5.653ms** per function call while using the current PyTorch 2.7 wheel, whereas it takes **266.096us** per function call while simply using `vmaxq_f32` - a 21x speedup, and overall inference is 1.8x faster.
___

Run the below script: `OMP_NUM_THREADS=1 python profile_mobilenet.py --iterations 10`
<details >
<summary>profile_mobilenet.py</summary>

```
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
import argparse

torch.manual_seed(42)

def load_mobilenet():
    model = models.mobilenet_v2(pretrained=True)
    model.eval()
    return model

def generate_sample_input(batch_size=8):
    return torch.randn(batch_size, 3, 224, 224)

def warmup(model, sample_input, num_warmup=10):
    with torch.inference_mode():
        for _ in range(num_warmup):
            _ = model(sample_input)

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--batch_size', type=int, default=8)
    parser.add_argument('--iterations', type=int, default=100)
    return parser.parse_args()

def main():
    args = parse_args()
    model = load_mobilenet()

    sample_input = generate_sample_input(args.batch_size)
    print("Warming up...")
    warmup(model, sample_input)
    print("Warmup complete.")
    with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
        with torch.inference_mode():
            for i in range(args.iterations):
                with record_function("model_inference"):
                    outputs = model(sample_input)

    print(prof.key_averages().table(sort_by="cpu_time_total"))
    print(f"Throughput: {(args.iterations * args.batch_size / (prof.profiler.self_cpu_time_total / 1e6)):.3f} images/s")

if __name__ == "__main__":
    main()
```

</details>

<details>
<summary>Profiler output using the current Pytorch 2.7 wheel </summary>

```
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                 model_inference         2.39%     101.839ms       100.00%        4.254s     425.437ms            10
                 aten::hardtanh_         0.02%     905.454us        46.50%        1.978s       5.653ms           350
                  aten::hardtanh         0.03%       1.239ms        46.48%        1.977s       5.650ms           350
                     aten::clamp        46.45%        1.976s        46.45%        1.976s       5.646ms           350
                    aten::conv2d         0.06%       2.468ms        43.89%        1.867s       3.591ms           520
               aten::convolution         0.06%       2.491ms        43.83%        1.865s       3.586ms           520
              aten::_convolution         0.13%       5.546ms        43.77%        1.862s       3.581ms           520
               aten::thnn_conv2d         0.04%       1.658ms        24.13%        1.027s       3.019ms           340
      aten::_slow_conv2d_forward        23.99%        1.021s        24.09%        1.025s       3.014ms           340
        aten::mkldnn_convolution        14.42%     613.285ms        19.51%     829.885ms       4.610ms           180
                aten::batch_norm         0.06%       2.368ms         6.89%     292.928ms     563.323us           520
    aten::_batch_norm_impl_index         0.11%       4.600ms         6.83%     290.560ms     558.769us           520
         aten::native_batch_norm         6.60%     280.762ms         6.69%     284.567ms     547.244us           520
                aten::contiguous         0.01%     623.099us         5.01%     213.152ms       1.184ms           180
                     aten::clone         0.02%     988.729us         5.00%     212.529ms       1.181ms           180
                     aten::copy_         4.94%     210.315ms         4.94%     210.315ms       1.052ms           200
                    aten::linear         0.00%      58.347us         0.18%       7.659ms     765.905us            10
                     aten::addmm         0.17%       7.373ms         0.18%       7.483ms     748.309us            10
                     aten::empty         0.17%       7.161ms         0.17%       7.161ms       1.790us          4000
                       aten::add         0.11%       4.742ms         0.11%       4.742ms      47.419us           100
                aten::empty_like         0.03%       1.315ms         0.09%       3.890ms       5.557us           700
                      aten::view         0.05%       1.933ms         0.05%       1.933ms       2.801us           690
               aten::as_strided_         0.04%       1.599ms         0.04%       1.599ms       8.885us           180
                   aten::resize_         0.04%       1.493ms         0.04%       1.493ms       2.871us           520
       aten::adaptive_avg_pool2d         0.00%      55.360us         0.04%       1.491ms     149.051us            10
                      aten::mean         0.00%     116.997us         0.03%       1.435ms     143.515us            10
                       aten::sum         0.02%     935.980us         0.02%     992.121us      99.212us            10
                    aten::detach         0.02%     707.217us         0.02%     707.217us       2.080us           340
                      aten::div_         0.00%     161.473us         0.01%     326.035us      32.604us            10
                        aten::to         0.00%     178.193us         0.01%     321.253us       0.892us           360
         aten::_nnpack_available         0.01%     302.835us         0.01%     302.835us       0.891us           340
                  aten::_to_copy         0.00%      63.170us         0.00%     143.060us      14.306us            10
                         aten::t         0.00%      49.759us         0.00%     117.621us      11.762us            10
                 aten::transpose         0.00%      40.637us         0.00%      67.862us       6.786us            10
                   aten::flatten         0.00%      42.634us         0.00%      58.867us       5.887us            10
                     aten::fill_         0.00%      56.141us         0.00%      56.141us       5.614us            10
                    aten::expand         0.00%      42.687us         0.00%      48.930us       4.893us            10
             aten::empty_strided         0.00%      40.589us         0.00%      40.589us       4.059us            10
                aten::as_strided         0.00%      33.468us         0.00%      33.468us       1.673us            20
              aten::resolve_conj         0.00%       9.066us         0.00%       9.066us       0.453us            20
                   aten::dropout         0.00%       5.782us         0.00%       5.782us       0.578us            10
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 4.254s

Throughput: 18.804 images/s
```

</details>

<details>
<summary>Profiler output after this PR's changes </summary>

```
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                 model_inference         4.43%     104.484ms       100.00%        2.359s     235.883ms            10
                    aten::conv2d         0.10%       2.313ms        79.19%        1.868s       3.592ms           520
               aten::convolution         0.10%       2.293ms        79.09%        1.866s       3.588ms           520
              aten::_convolution         0.23%       5.436ms        78.99%        1.863s       3.583ms           520
               aten::thnn_conv2d         0.08%       1.799ms        44.29%        1.045s       3.072ms           340
      aten::_slow_conv2d_forward        44.03%        1.039s        44.21%        1.043s       3.067ms           340
        aten::mkldnn_convolution        24.91%     587.584ms        34.47%     812.992ms       4.517ms           180
                aten::batch_norm         0.10%       2.350ms        11.83%     279.113ms     536.757us           520
    aten::_batch_norm_impl_index         0.20%       4.788ms        11.73%     276.764ms     532.238us           520
         aten::native_batch_norm        11.30%     266.660ms        11.46%     270.420ms     520.038us           520
                aten::contiguous         0.02%     575.723us         9.41%     222.080ms       1.234ms           180
                     aten::clone         0.04%       1.061ms         9.39%     221.504ms       1.231ms           180
                     aten::copy_         9.29%     219.131ms         9.29%     219.131ms       1.096ms           200
                 aten::hardtanh_         0.04%     917.669us         3.95%      93.133ms     266.096us           350
                  aten::hardtanh         0.05%       1.130ms         3.91%      92.216ms     263.474us           350
                     aten::clamp         3.85%      90.894ms         3.86%      91.086ms     260.246us           350
                    aten::linear         0.00%      68.681us         0.33%       7.899ms     789.945us            10
                     aten::addmm         0.32%       7.598ms         0.33%       7.707ms     770.673us            10
                     aten::empty         0.30%       7.176ms         0.30%       7.176ms       1.794us          4000
                       aten::add         0.20%       4.627ms         0.20%       4.627ms      46.268us           100
                aten::empty_like         0.06%       1.316ms         0.17%       3.973ms       5.676us           700
                      aten::view         0.08%       2.001ms         0.08%       2.001ms       2.899us           690
       aten::adaptive_avg_pool2d         0.00%      53.745us         0.07%       1.548ms     154.791us            10
                   aten::resize_         0.06%       1.533ms         0.06%       1.533ms       2.948us           520
               aten::as_strided_         0.06%       1.521ms         0.06%       1.521ms       8.450us           180
                      aten::mean         0.00%     117.637us         0.06%       1.494ms     149.417us            10
                       aten::sum         0.04%     973.291us         0.04%       1.013ms     101.342us            10
                    aten::detach         0.03%     652.224us         0.03%     652.224us       1.918us           340
                      aten::div_         0.01%     195.077us         0.02%     363.103us      36.310us            10
                        aten::to         0.01%     212.758us         0.02%     359.655us       0.999us           360
         aten::_nnpack_available         0.01%     295.235us         0.01%     295.235us       0.868us           340
                  aten::_to_copy         0.00%      68.726us         0.01%     146.897us      14.690us            10
                         aten::t         0.00%      53.873us         0.01%     124.033us      12.403us            10
                 aten::transpose         0.00%      42.512us         0.00%      70.160us       7.016us            10
                   aten::flatten         0.00%      44.040us         0.00%      66.631us       6.663us            10
                    aten::expand         0.00%      44.632us         0.00%      51.177us       5.118us            10
                     aten::fill_         0.00%      40.134us         0.00%      40.134us       4.013us            10
             aten::empty_strided         0.00%      35.291us         0.00%      35.291us       3.529us            10
                aten::as_strided         0.00%      34.193us         0.00%      34.193us       1.710us            20
              aten::resolve_conj         0.00%       8.594us         0.00%       8.594us       0.430us            20
                   aten::dropout         0.00%       6.758us         0.00%       6.758us       0.676us            10
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.359s

Throughput: 33.915 images/s
```

</details>

___

Using torchbench, the models `mobilenet_v2` and `mobilenet_v3_large` showed improvements as expected too.

Before -> After (latency in ms)
```
"mobilenet_v3_large-eval_latency": 1207.212 -> 844.902
"mobilenet_v2-eval_latency": 1029.834 -> 662.476
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152538
Approved by: https://github.com/Skylion007
2025-05-05 17:21:11 +00:00
131da0a982 Add a test for AsyncCollectiveTensor handling for maybe-view ops (#152688)
We never added a proper test for the fix from https://github.com/pytorch/pytorch/pull/134661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152688
Approved by: https://github.com/kwen2501
ghstack dependencies: #152195
2025-05-05 17:21:00 +00:00
5abe74857a SAC: fix recompute tag propagation for ops with list[tensor] inputs (#152195)
There's an "are we compiling" check in SAC, which we rely on to know when to propagate recompute tags during tracing.

This check was a bit brittle, and missed cases where input ops accept list of tensors - I updated it to check if a `FunctionalTensorMode` is active, which should be a 100% reliable way to know if AOTDispatcher is in the middle of running.

There is a long-standing followup here around unifying `torch.compiler.is_compiling()` to work in all cases. We should probably just update it to always check if FakeMode/FunctionalMode are active and use it there. This has a bit of BC risk though so I opted for the more local fix to SAC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152195
Approved by: https://github.com/soulitzer
2025-05-05 17:21:00 +00:00
7c96dd8f0c Add infra to run CPython tests under Dynamo (#150787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787
Approved by: https://github.com/zou3519
2025-05-05 17:20:14 +00:00
50fe1b2349 Implement async manifold cache write (#152452)
Summary: This diff implements an AsyncManifoldCache class that performs cache write and update ttl operations in an async manner. Essentially we are ok with the fire and forget approach where we dont guarantee that we can observe our writes, this gives us better runtime latency.

Test Plan: added new unit test

Reviewed By: jamesjwu

Differential Revision: D73867797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152452
Approved by: https://github.com/jamesjwu
2025-05-05 16:45:48 +00:00
3196a3aca0 [CI] Use cmake from pip instead of conda in CI docker images (#152537)
As in title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152537
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-05-05 16:32:40 +00:00
d119481717 [cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577)
For default level, it went from 0.11332 seconds to Filtering took 0.10064 seconds.

You can't really apply lru_cache too aggressively. For example, hashing a cutlass op takes a long time.

Removing a log further bring it down to 0.07202 seconds

Differential Revision: [D73971021](https://our.internmc.facebook.com/intern/diff/D73971021/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152577
Approved by: https://github.com/chenyang78
2025-05-05 16:27:16 +00:00
9a9cc48c65 Update SGD documentation to match implementation (#149884)
Fixes #149476

This PR updates the pseudocode description of the SGD optimizer to better match the implementation.

Updated pseudocode:

![image](https://github.com/user-attachments/assets/2d7bc618-0408-4909-b835-af6465736918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149884
Approved by: https://github.com/janeyx99
2025-05-05 16:06:17 +00:00
7a2df6a00b [PGNCCL] Add FP8 support (#152706)
NCCL added support for `Float8e4m3` and `Float8e5m2` in 2.24.

NVIDIA GPUs does not seem to support the following "no negative zero" versions: `Float8_e4m3fnuz` and `Float8_e5m2fnuz`, see https://onnx.ai/onnx/technical/float8.html. So we continue to error out for these two upon a reduction op.

Test plan:
- test_allreduce_float8
- test_reduce_scatter_float8

Resolves #148344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152706
Approved by: https://github.com/d4l3k, https://github.com/eqy, https://github.com/fduwjj, https://github.com/cyyever
2025-05-05 16:02:27 +00:00
a1516d9e6e Add "#pragma once" to CachingHostAllocator.h (#152800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152800
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-05 15:21:14 +00:00
fe36d7dc44 [MPSInductor] Fix truncdiv implementation (#152788)
For integral dtypes it should be just an alias for division

Fixes `GPUTests.test_div7_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152788
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #152663, #152515, #152737, #152743, #152758
2025-05-05 13:31:51 +00:00
87f2bd2439 Remove conda usage in windows binary builds (#151035)
This is related to : https://github.com/pytorch/pytorch/issues/146048
Removing conda from windows binary builds. At this point we are only removing conda and replacing it with python builds. Not rewriting all batch files as python or bash.

Additionally cleanup unused files:
```
.ci/pytorch/windows/internal/static_lib_test.bat
.ci/pytorch/windows/internal/env_fix.bat
.ci/pytorch/windows/internal/vs_install.bat
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151035
Approved by: https://github.com/cyyever, https://github.com/clee2000, https://github.com/malfet
2025-05-05 13:09:05 +00:00
0a470dc7c1 [inductor] fix lowering for cummin, cummax for one element tensors (#151931)
Fixes https://github.com/pytorch/pytorch/issues/151738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151931
Approved by: https://github.com/eellison
2025-05-05 13:05:59 +00:00
2825a28bf1 Exempt overriding methods from docstring_linter (fix #151692) (#151906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151906
Approved by: https://github.com/Skylion007
2025-05-05 12:39:42 +00:00
9210a98b92 [xla hash update] update the pinned xla hash (#152809)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152809
Approved by: https://github.com/pytorchbot
2025-05-05 11:21:11 +00:00
ac9fcd6346 [Inductor][CPU] bug fix for int8 GEMM compensation epilogue (#152408)
Fixes #152398

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152408
Approved by: https://github.com/leslie-fang-intel
2025-05-05 08:26:47 +00:00
7e637de9cb [Flight Recorder] Added logging after FR dump completed (#152648)
Summary: TSIA

Test Plan: eyes

Differential Revision: D74041147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152648
Approved by: https://github.com/fduwjj, https://github.com/wdvr
2025-05-05 06:17:47 +00:00
0ffd31dc8a [MPS] Migrate div roudning modes (#152758)
By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25
```
import torch
print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc"))
tensor([[24.,  3.]])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515, #152737, #152743
2025-05-05 03:02:29 +00:00
93d8f6ee32 [reland] Detailed triton kernel logging (#152694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152694
Approved by: https://github.com/Skylion007
2025-05-05 02:46:57 +00:00
a78eec88b8 Implement util function compute_global_tensor_shape for 1D device mesh (#152751)
### Summary

Recreating #151990 to mitigate easyCLA failure

compute_global_tensor_shape util function takes in local tensor shape, device mesh
and placements. We all gather the shapes from the shards and according to the placement
type we construct the global shape.

Note: currenty only implemented for placement type Shard and Replicate, TODO for StridedShared

### Test

`pytest test/distributed/tensor/test_utils.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152751
Approved by: https://github.com/XilunWu
2025-05-05 02:44:31 +00:00
30453d60dd Add methods for checking Triton availability to the device interface (#152529)
Adds the `is_triton_capable` and `raise_if_triton_unavailable` class methods to the device interface, to allow device types to run their own checks for Triton _capability_ (which means a device can actually support Triton in the first place) and _availability_ (if the correct backend of Triton is installed and is functional for the device).

Using the device interface allows us to do these checks in a device-agnostic way, allow external backends to attest their Triton support by simply implementing those methods. The intention is for this to back things like the `has_triton` utility method.

This has been split from #139171.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152529
Approved by: https://github.com/jansel
2025-05-05 00:55:53 +00:00
8dbe1ff34b Revert "Avoid triggering ignored requires_grad warning in our code (#152686)"
This reverts commit f51bee137518cde82e88ec655988e7eb1b94a3f3.

Reverted https://github.com/pytorch/pytorch/pull/152686 on behalf of https://github.com/wdvr due to failinginternal test, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152686#issuecomment-2849497208))
2025-05-04 23:34:34 +00:00
49b9efdf1f [BE]: Cleanup traceutils with fmtlib (#152265)
Simplify code and make it faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-04 22:27:19 +00:00
82cb202de7 [Inductor][NCU] Add kernel name filtering, and allow custom metrics (#150872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150872
Approved by: https://github.com/FindHao

Co-authored-by: Yueming Hao <yhao@meta.com>
2025-05-04 20:49:19 +00:00
b117a6c47b Fix two error messages involving Tensor.dense() (#152631)
Two error messages in the codebase instruct the user to use `Tendor.dense()`. This method doesn't exist, but `Tensor.to_dense()` does, and this is what the user should be using instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152631
Approved by: https://github.com/jansel
2025-05-04 20:44:08 +00:00
220870ce9e [caffe2] Support building for armv8.1 (#152766)
Summary:
- Remove explicit `-march=` compiler flags, as they're already implied by
   the toolchain:
https://www.internalfb.com/code/fbsource/[7f85b0565073]/fbcode/tools/build/buck/wrappers/defs.bzl?lines=819
- Gate non-8.1 compliant opcodes with `__ARM_FEATURE_*`.

Test Plan: CI

Reviewed By: rahulg

Differential Revision: D74023601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152766
Approved by: https://github.com/Skylion007
2025-05-04 19:09:21 +00:00
a69da90a9f Add pad limit of avg_poolnd and AvgPoolnd (#152680)
Fixes #152156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152680
Approved by: https://github.com/mikaylagawarecki
2025-05-04 17:25:22 +00:00
cyy
370e23388d Set CMake 3.5 as minimum version in pytorch_android (#152769)
I saw pytorch_android failure in docker image builds. This fix attempts to bypass CMake 4 limitations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152769
Approved by: https://github.com/Skylion007
2025-05-04 16:57:22 +00:00
8f54e56e62 Add optional device index to AOTIModelPackageLoader (#152093)
This is my suggestion for resolving #152087

This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093
Approved by: https://github.com/desertfire
2025-05-04 11:40:12 +00:00
fd8fd01d25 [OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914)
As the title stated.

**Changes**:
- Add get_rng_state & set_rng_state support for OpenReg
- Add _lazy_init support for OpenReg
- Remove redundant code for cuda/Module.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914
Approved by: https://github.com/albanD
2025-05-04 09:42:08 +00:00
c8bac51ec1 Remove the unnecessary cuda/Tensor.cpp (#152522)
As the title stated.

**Question:**

I have carefully looked through all the .h files in Tensor.cpp and from my perspective this file does not make sense. Does anyone know what the background is for doing this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152522
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/eqy
ghstack dependencies: #152512, #152513, #152521
2025-05-04 07:15:11 +00:00
8562457cba Make torch/csrc/utils.h to be device-agnostic (#152521)
`torch/csrc/utils.h` should be device-independent. Currently, it contains CUDA-related implementations, which indirectly causes the [failure of ROCm testing](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038) (The reason is that the ROCm test environment shouldn`t expose HIP-related header files, which causes the JIT compilation to fail during testing)

Therefore, move CUDA-related implementations to `torch/csrc/cuda/utils.h`.

**Question:**
This change may introduce BC-breack.
I searched for this function globally on github and I think the impact is very small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152521
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #152512, #152513
2025-05-04 07:15:11 +00:00
e889937850 [MPS] Migrate div to Metal (#152743)
TODOs:
 - Verify accuracy of  `metal::dot` vs `x.x*x.x + y.y*y.y`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152743
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #152663, #152515, #152737
2025-05-04 00:56:19 +00:00
8faa225695 Revert "[inductor] Realize bucketize/searchsorted output (#152644)"
This reverts commit 9ae4906b21cbd186a493a9564e22a42da2184e3a.

Reverted https://github.com/pytorch/pytorch/pull/152644 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152644#issuecomment-2848743442))
2025-05-03 18:16:39 +00:00
6ae690f8f0 add support for 0 size shardedTensor and recalculate metadata from all_gather (#152583)
Summary:
change set
1. a ShardedTensor could have 0 size initially, the current check won't pass if the size is 0, added here
2. when we call ShardedTensor._init_from_local_shards, it will assume all the metadata is correct, all_gather to double check. In the new case, the metadata could be all 0 size, and the tensor has actual size, we need to provide such capability to recalculate the local/global metadata from the local tensor by all_gathering the information

Test Plan: i don't see a UT is associated, I have tested this with diff stack, D73274786.

Differential Revision: D73903933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152583
Approved by: https://github.com/q10, https://github.com/fduwjj
2025-05-03 17:26:29 +00:00
762844355e Make DispatchKeySet serializable; add __eq__ (#152732)
These seem like reasonable things to add. Also fixes a bug in vLLM for
me.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152732
Approved by: https://github.com/bdhirsh
2025-05-03 14:40:06 +00:00
792736f9ac [BE][MPS] Pass alpha by reference (#152737)
As it's always a scalar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152737
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515
2025-05-03 08:31:45 +00:00
cc28b43950 Revert "[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)"
This reverts commit 844842dfbf937c43b41c528e461d3f3931bca6e9.

Reverted https://github.com/pytorch/pytorch/pull/151368 on behalf of https://github.com/malfet due to This broke inductor cpp wrapper ([comment](https://github.com/pytorch/pytorch/pull/151368#issuecomment-2848519706))
2025-05-03 08:31:31 +00:00
457fa820ad [c10d] Fix extra CUDA context created by barrier (#149144)
Fixes #149119.

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses `device_id` given by user
when calling `init_process_group`.

This PR also uses `torch._C._get_accelerator()` to determine the device
type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever
2025-05-03 03:13:34 +00:00
34e9f0b5c6 [MPS] Migrate mul to TensorIterator (#152515)
What initially supposed to be a very straightforward change resulted in small refactor of binary op tensor generators when  invoked for mixed dtype, which surfaced via `test_output_grad_match_sinc_mps_float16` test failure.

If operands are of different dtype (in particular float16 tensor and float32 scalar), one must perform an operation with `opmath_t` (or `TensorIterator::common_dtype()`) precision, rather than casting both operands to output dtype and performing it then, which can be demonstrated via the following example:
```
>>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926,  8.5938,  5.9766], dtype=torch.half).mul(torch.pi)
tensor([ -5.8555,  19.4844,  -7.0703, -10.6562,  27.0000,  18.7812],
       dtype=torch.float16)
>>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926,  8.5938,  5.9766], dtype=torch.half).mul(torch.tensor(torch.pi, dtype=torch.float16))
tensor([ -5.8516,  19.4844,  -7.0664, -10.6562,  26.9844,  18.7656],
       dtype=torch.float16)
```

Solve this problem for now, but introducing `REGISTER_OPMATH_BINARY_OP` that indicates that operands must be cast to opmath_t, before performing the computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152515
Approved by: https://github.com/Skylion007, https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #152663
2025-05-03 02:35:03 +00:00
1cd68c59dd Remove incorrect assertion (#152653)
It's only aspirational that the 'improvement' value is positive. In fact
the pass could make a collective more exposed and we shouldn't assert
here in that case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152653
Approved by: https://github.com/eellison
ghstack dependencies: #152565
2025-05-03 02:33:58 +00:00
84aa0985fb [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-03 02:23:54 +00:00
5e9682719f [Inductor UT] Generalize device-bias code in test_flex_attention.py (#151937)
@EikanWang @etaf @guangyey please take a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151937
Approved by: https://github.com/drisspg
2025-05-03 01:12:49 +00:00
73b6b1ded4 [inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494)
Before
![image](https://github.com/user-attachments/assets/62b24c14-69e6-40fb-94e3-223930132ef6)

After
![image](https://github.com/user-attachments/assets/9f340d4e-80a9-45aa-9400-626fff5b5ecd)

tlparse - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmph5dwWt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152494
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-03 00:38:08 +00:00
36140e01fd Rename "startup-tracing-compile" to "compile-time" in label_to_label.yml (#152711)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152711
Approved by: https://github.com/oulgen
2025-05-03 00:35:05 +00:00
3d777bae10 Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #148104
2025-05-03 00:02:24 +00:00
2b37a726e0 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-05-03 00:02:24 +00:00
0e59b594ee [SymmMem] Use cub's BlockScan instead of in-house impl for offset calculation (#151993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151993
Approved by: https://github.com/ngimel
ghstack dependencies: #151261, #151498, #151819
2025-05-02 23:40:47 +00:00
2107d87dc9 [BE] remove outdated warning about TORCH_CUDA_ARCH_LIST (#152715)
I saw this warning when compiling a 3rd party lib and did not agree with it. I'm not sure the original reason why we would want to force people to pass in TORCH_CUDA_ARCH_LIST to cmake vs set it as an env var. As a developer, it's much easier to set it as an env var or have it be autodetected. I also realized this warning was from before 2018!!! 7 years ago! And there are no plans to actually enforce this (nor should there be), so let's remove this misleading warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152715
Approved by: https://github.com/malfet, https://github.com/zou3519
2025-05-02 23:00:51 +00:00
a6ea63a841 [FlexAttention] explicilty create grad_q w/ strides (#152641)
Fixes: #147463

There is a mismatch between inductor's lowering for empty_like and it does not match the behavior of eager. The strides do not match preserve format

https://github.com/pytorch/pytorch/issues/144699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152641
Approved by: https://github.com/xmfan
2025-05-02 22:57:26 +00:00
54f29b04d6 Improve error wording in _link_check.yml (#152726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152726
Approved by: https://github.com/huydhn
2025-05-02 22:43:05 +00:00
730a077d48 [ROCm] Unskipped test_rnn_dropout_state for ROCm (#152339)
Unskipping the test, should work fine now.

Related PR: https://github.com/pytorch/pytorch/pull/144572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152339
Approved by: https://github.com/jeffdaily
2025-05-02 22:02:30 +00:00
ea12a38668 [associative_scan] Refactoring of input checking and dynamo invocation (#148657)
This PR is the counterpart of https://github.com/pytorch/pytorch/pull/142125 for the associative_scan operation. The way the input checks are performed and the combine_fn is not invoked in the frontend to check the output trees, but rather dynamo is used for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148657
Approved by: https://github.com/ydwu4
2025-05-02 21:39:28 +00:00
8afe40bc5e [Inductor] Fix kernel argument ordering when using dynamic shapes with workspace (#152660)
Summary:
This PR fixes a bug in the Triton kernel invocation path where the `workspace_tensor` was inserted before the unpacked `extra_args` list in the final kernel argument list. This broke the expected ordering of arguments when dynamic shape size hints are emitted.

When dynamic shapes are used, `extra_args` contains both size hint arguments and grid arguments. The kernel expects the argument list to follow the order: **size hints → workspace tensor → grid args**. But previously, the `workspace_tensor` was inserted before unpacking `extra_args`, resulting in: **workspace tensor → size hints → grid args**, which is incorrect.

This fix constructs the workspace tensor earlier, allowing it to be slotted in after the size hints and before the grid arguments, restoring the expected argument layout.

Test Plan:
contbuild and OSS CI

Reviewers: paulzhan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152660
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg
2025-05-02 21:32:07 +00:00
add4702ebc [Inductor] Introduce Wrapper IR line for symbolic call args (#152587)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.

This PR introduces a new wrapper IR line to represent symbolic call args. This deletes a little bit of duplicated code between the Python and C++ backends. In the main PR, having a Wrapper IR line for this also tells the FX backend what this part of the wrapper code is doing. Before this PR, symbolic call args generated raw Python lines, which confuse the FX converter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152587
Approved by: https://github.com/jansel
2025-05-02 20:37:00 +00:00
9ae4906b21 [inductor] Realize bucketize/searchsorted output (#152644)
**Context**:
bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro:

https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554

shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\*\*15 binary searches, the fused version does 2\*\*15 * 384 - one for each of the broadcasted outputs.

**Solution**:
Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on.

After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled).

Differential Revision: [D74036850](https://our.internmc.facebook.com/intern/diff/D74036850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152644
Approved by: https://github.com/benjaminglass1, https://github.com/eellison
2025-05-02 20:31:17 +00:00
74b496e54c Cleanup DeviceInterface in triton test (#152409)
- Remove inherited functions
- Return valid device_count (1 device: idx=0)
- Remove unused function `triton_supported`

Followup to #144399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152409
Approved by: https://github.com/jansel
2025-05-02 20:25:32 +00:00
44f29a3669 Add parameters for monitor (#152541)
Add log interval and log-data-collect interval to all test yml

Add upload step for all test yml files

next step:
enable perf test with utilization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152541
Approved by: https://github.com/huydhn
2025-05-02 20:24:11 +00:00
ec68d082a1 [CUDA][TF32] Account for TF32 in test_conv2d_same_padding (#152618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152618
Approved by: https://github.com/msaroufim, https://github.com/Skylion007
2025-05-02 20:19:00 +00:00
39c0b01970 [ez] Disable failing test in periodic no gpu no avx (#152698)
Failing on periodic after it was added in #152542
Ex
inductor/test_cpu_repro.py::CPUReproTests::test_tanh_atan2_use_decompose_tanh [GH job link](https://github.com/pytorch/pytorch/actions/runs/14775755628/job/41485185829) [HUD commit link](6f6acb4128)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152698
Approved by: https://github.com/huydhn, https://github.com/hl475
2025-05-02 20:02:48 +00:00
a6dd1c2208 [DCP] Add 30min timeout for IPC communications in async checkpointing (#152629)
Summary:
### Diff Context
- Sometime background process can be stuck processing async checkpoint request, and trainer shutdown can occur before the background process completes.
- Fix, timeout the thread while reading the IPC queue for a response from background process.

Differential Revision: D74017700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152629
Approved by: https://github.com/saumishr
2025-05-02 19:36:22 +00:00
5d860c1e54 [ROCm][CI] Enabled fp8 distributed tests in test_micro_pipeline_tp.py for MI300 (#151977)
This PR enabled fp8 distributed tests on MI300.
For testing the added feature, ran distributed.tensor.parallel.test_micro_pipeline_tp test and all the tests passed successfully, and no tests were skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151977
Approved by: https://github.com/jeffdaily
2025-05-02 19:22:18 +00:00
d457b4492d Optimize Sequential methods description (#147304)
Fixes #146892

Add methods description and examples for [`Sequential` document](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/3121a06f-02ed-4362-ad0a-f055bb43d469)

### After

![image](https://github.com/user-attachments/assets/66f6bb55-5298-4062-8f7f-7a7f4c1e16d9)
![image](https://github.com/user-attachments/assets/a5275a4c-4214-4518-b7a2-dff21954f368)
![image](https://github.com/user-attachments/assets/9c40d1fb-114a-4d14-a3c4-1143a131660e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147304
Approved by: https://github.com/mikaylagawarecki
2025-05-02 19:18:58 +00:00
eqy
216d81da81 [CUDA][complex] skip test_reference_numerics_large_jiterator_unary_cuda_complex64 on CUDA (#148024)
already skipped on ROCM for a similar reason, recent numpy versions changed convention from `nan+infj` to `-inf+infj`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148024
Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2025-05-02 19:11:11 +00:00
16153a0f27 [AOTAutogradCache][Easy] Move "einops.einops.rearrange" to SAFE_NON_TORCH_FUNCTIONS (#152640)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152640
Approved by: https://github.com/oulgen, https://github.com/zou3519, https://github.com/bdhirsh
2025-05-02 19:09:30 +00:00
0488883d6e [cuDNN][SDPA] Fix head-dim 256 condition for SM 10.0 (#152076)
turns out the backward is not supported yet, whoops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152076
Approved by: https://github.com/drisspg
2025-05-02 18:43:33 +00:00
07290bdcdc Skip search for MKL on ARM cpus (#145850)
It will not find it anyway and makes a bit easier parsing thru CMake log on non-x86 systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145850
Approved by: https://github.com/atalman
2025-05-02 18:39:49 +00:00
1ea2731e26 [ROCm] Add support for SymmetricMemory (#150580)
This is an attempt to re-land the initial PR https://github.com/pytorch/pytorch/pull/134817 with recent design changes from upstream.

**NOTE:**
ROCm currently does NOT have multicast/multimem hardware support at the moment, so those features are disabled in symmetric memory for ROCm. This also means that we currently do not have a way of lowering add + all_reduce + wait_tensor into one_shot_all_reduce op in inductor as it depends on a multicast buffer support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150580
Approved by: https://github.com/jeffdaily, https://github.com/kwen2501, https://github.com/yoyoyocmu

Co-authored-by: Xiaodong Wang <xdwang@fb.com>
2025-05-02 18:35:14 +00:00
376529c78b consolidate guard_or_x and definitely_x (#152463)
definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the
existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463
Approved by: https://github.com/bobrenjc93
2025-05-02 18:08:11 +00:00
72337bdcf2 [ATen][CUDA] Optimize 128 bit vectorization (#148320)
Fixes #147376.
As per request: https://github.com/pytorch/pytorch/pull/145746#pullrequestreview-2642118301
This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148320
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman
2025-05-02 17:35:44 +00:00
3baa85cfad [StaticCudaLauncher] Ensure cuda context exists before launching kernels (#152667)
Triton does this already due to  https://github.com/triton-lang/triton/pull/3731/files, in order to fix https://github.com/pytorch/pytorch/issues/124565. We need to do the same thing as triton here, so that in cases with no compilation we still have a cuda context in the backward autograd thread.

Fixes https://github.com/pytorch/pytorch/issues/152639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152667
Approved by: https://github.com/oulgen
2025-05-02 17:29:57 +00:00
f51bee1375 Avoid triggering ignored requires_grad warning in our code (#152686)
This one is ok to silence as we're just doing formatting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152686
Approved by: https://github.com/Skylion007
2025-05-02 17:27:47 +00:00
844842dfbf [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-02 17:21:18 +00:00
f65fb0a23d Make PGO code state not sensitive to file path by hashing file content when the file is available. (#152628)
In some internal frameworks, on second attempts the actual code is copied to a different path than previous attempts.
but its still the same. PGO will not work on those cased due to the following, sate entries before this PR used to be identified by (filepath, function name, line number).

after this PR they are identified by (hash(filepath) , function name, line number). This way PGO will work for those jobs on future attempts and re-compilations of static versions will be avoided.

Sometimes we do not have access to the source code, (file does not exists)
This seems to happen mostly when we re-trace a compiled function but generally it can happen .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152628
Approved by: https://github.com/oulgen
2025-05-02 17:11:21 +00:00
ea4b7e0e1d [invoke_subgraph] Simplify output code for subgraph output node (#152490)
Before - [manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)
![image](https://github.com/user-attachments/assets/8fecdc23-eb78-4e15-9d03-c4bae4b49434)

After fix - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp9a5EM0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
![image](https://github.com/user-attachments/assets/8e98120c-d82e-42dc-bc50-a6bfd4f9923c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152490
Approved by: https://github.com/eellison
ghstack dependencies: #152383
2025-05-02 16:31:25 +00:00
5c0f474dac Do not check out nccl when not building it (#152533)
Add additional conditions to `build_pytorch_libs.py` to avoid fetching NCCL when `USE_CUDA` or `USE_NCCL` are disabled. While at it, adjust the existing condition for `USE_SYSTEM_NCCL` to use the utility function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152533
Approved by: https://github.com/albanD
2025-05-02 16:31:03 +00:00
f6761f2968 [inductor][subgraph] Simplify the resulting output code for subgraph (#152383)
Check out output code

Before this PR -  - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp3iXDVs/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
![image](https://github.com/user-attachments/assets/ef86eb8f-e8b9-47dd-8609-f90481f018b8)

After this PR - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRgUJvq/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/10e22c60-7fb9-4519-9d54-019beff5333b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152383
Approved by: https://github.com/eellison
2025-05-02 15:52:34 +00:00
cb0cf7e5c7 [MPS][BE] Do not dispatch empty kernels (#152663)
If `iter.numel()` is zero no need to dispatch kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152663
Approved by: https://github.com/kulinseth
2025-05-02 14:34:53 +00:00
50d4698ac8 Revert "[cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577)"
This reverts commit 1fef3cdabc3f79fd0cbf9273052057ef6122710f.

Reverted https://github.com/pytorch/pytorch/pull/152577 on behalf of https://github.com/wdvr due to failing test_unary_ufuncs.py::TestUnaryUfuncsCUDA::test_reference_numerics_large_jiterator_unary_cuda_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14787347116/job/41519095088) [HUD commit link](1fef3cdabc) ([comment](https://github.com/pytorch/pytorch/pull/152577#issuecomment-2846544603))
2025-05-02 07:25:25 +00:00
cyy
e9e1aacef8 Enable -Wunused on torch targets (#150077)
For GCC, ``-Wunused`` contains:
```
-Wunused-function
Warn whenever a static function is declared but not defined or a non\-inline static function is unused.

-Wunused-label
Warn whenever a label is declared but not used.
To suppress this warning use the unused attribute.

-Wunused-parameter
Warn whenever a function parameter is unused aside from its declaration.
To suppress this warning use the unused attribute.

-Wunused-variable
Warn whenever a local variable or non-constant static variable is unused aside from its declaration
To suppress this warning use the unused attribute.
```
For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default:
```
Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument),
[-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable),
[-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function),
[-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture),
[-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef),
[-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field),
[-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar),
[-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable).
```
These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077
Approved by: https://github.com/zou3519, https://github.com/wdvr
2025-05-02 07:14:19 +00:00
38a9a8b7f7 Fix: Consider input defined unbacked during inductor codegen for runtime asserts (#152231)
So when we use mark_unbacked the graph will have an unbacked inputs symInt. Right now,
deferred runtime assertions that uses those  is never generated.

This PR changes that, such that in the forward graph we consider those and generate the corresponding
runtime assertions of them. We still ignore them for backward which is not ideal

The way we generate runtime assertion is by emitting them when all the defined unbacked symbols used
in them are seen.

We previously skipped placeholder, because for backward we have a wacky approach were we
ignore input defined unbacked symbols and assumes assertions that uses them are already emitted
in forward and we try to emit all other runtime assertions again. see [Note [Backwards runtime asserts]

Doing that we ends up only emitting the runtime assertions that depends on things defined solely in backward, but we could miss checks that spans inputs defined in both backward and forward, i.e one symbol defined in forward passed as input to backward., and another that is defined in backward.) .This is not ideal an ideal approach could be something like this https://github.com/pytorch/pytorch/pull/151919 but it require more work .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152231
Approved by: https://github.com/aorenste
2025-05-02 07:01:48 +00:00
829752ba37 [SymmMem] Add all_to_all_vdev (#151819)
Merge in/out splits into one tensor

Multi-block

Use sync instead of barrier

Use nvshmemx_collective_launch

Rotate blocks among peer

write back input splits

Parallel scan works

Use scan for output offsets

Use at most 16 blocks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151819
Approved by: https://github.com/ngimel, https://github.com/fduwjj
ghstack dependencies: #151261, #151498
2025-05-02 06:59:21 +00:00
6dadfc4457 Revert "Enable -Wunused on torch targets (#150077)"
This reverts commit 688adc9941f855e78dd4d595682eea16317b7f54.

Reverted https://github.com/pytorch/pytorch/pull/150077 on behalf of https://github.com/wdvr due to failing internally with use of undeclared identifier ([comment](https://github.com/pytorch/pytorch/pull/150077#issuecomment-2846499828))
2025-05-02 06:53:20 +00:00
3731b70b40 [inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384)
For invoke_subgraph, input assertions are good. We don't need output assertions. This is the tlparse

Before
![image](https://github.com/user-attachments/assets/4ae14530-3314-4dfa-9297-58f9e3ee4b9c)

After
![image](https://github.com/user-attachments/assets/c1457687-2396-49a7-986b-ef6145fcbf46)

https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152384
Approved by: https://github.com/eellison, https://github.com/zou3519
ghstack dependencies: #152547, #152581
2025-05-02 06:46:05 +00:00
9e3fc41060 [invoke_subgraph] rename identifiers to prevent python mangling (#152581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152581
Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519
ghstack dependencies: #152547
2025-05-02 06:46:05 +00:00
4f9f1abd6d Revert "Use swap_tensors path in nn.Module.to for all subclasses that override __torch_dispatch__ (#152539)"
This reverts commit 037343657edceb345001e4c0ff226a34ca4c6063.

Reverted https://github.com/pytorch/pytorch/pull/152539 on behalf of https://github.com/wdvr due to failing internal tests - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152539#issuecomment-2846484924))
2025-05-02 06:43:35 +00:00
d7961a1086 [SymmMem] Add all-to-all (#151498)
Add an all-to-all impl based on NVSHMEM's on-stream API `nvshmemx_alltoallmem_on_stream`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151498
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #151261
2025-05-02 06:40:43 +00:00
7c3e679ddd Revert "[Inductor] Add decomposeK as an autotuning choice for mm (#150654)"
This reverts commit fdcfc6a61a2146c7c961073e029ead633113eb9a.

Reverted https://github.com/pytorch/pytorch/pull/150654 on behalf of https://github.com/wdvr due to Failing ROCM tests: inductor/test_subgraph_choice.py::TestSubgraphChoice::test_subgraph_decompose_k [GH job link](https://github.com/pytorch/pytorch/actions/runs/14786111108/job/41515742446) [HUD commit link](3c54e0c216) ([comment](https://github.com/pytorch/pytorch/pull/150654#issuecomment-2846470409))
2025-05-02 06:31:38 +00:00
4649fd17b0 [invoke_subgraph] Unpacked operands (#152547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152547
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-05-02 05:44:46 +00:00
e6989ceea9 Revert "[BE] Update numba versions (#152557)"
This reverts commit b5995cb67f8543f148b9216e140980e6844aadff.

Reverted https://github.com/pytorch/pytorch/pull/152557 on behalf of https://github.com/clee2000 due to test_unary_funcs failure seems real? [GH job link](https://github.com/pytorch/pytorch/actions/runs/14787082066/job/41518415014) [HUD commit link](b5995cb67f) ([comment](https://github.com/pytorch/pytorch/pull/152557#issuecomment-2846336004))
2025-05-02 05:22:17 +00:00
ac5de6d55a Remove unnecessary __STDC_FORMAT_MACROS macro (#152513)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152513
Approved by: https://github.com/cyyever, https://github.com/albanD
ghstack dependencies: #152512
2025-05-02 05:06:44 +00:00
d969e2ec33 [CUDAGraph Trees] support memory allocation on side stream (#152472)
I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more.

However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool.

So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id.

Fixes #151199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472
Approved by: https://github.com/eellison, https://github.com/ngimel
2025-05-02 04:26:35 +00:00
1f898657e6 [ez] fix grammar mistakes in StatefulSymbolicContext comment (#152598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152598
Approved by: https://github.com/malfet
ghstack dependencies: #151407
2025-05-02 04:21:16 +00:00
36e5ff6bc4 [CP] Fix the offsets to KV in backward (#152625)
This is more semantically correct even though we currently assumed KV have the same lengths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152625
Approved by: https://github.com/XilunWu
2025-05-02 03:30:11 +00:00
1fef3cdabc [cutlass backend] Minor lru_cache to slightly speed up filtering ops (#152577)
For default level, it went from 0.11332 seconds to Filtering took 0.10064 seconds.

You can't really apply lru_cache too aggressively. For example, hashing a cutlass op takes a long time.

Removing a log further bring it down to 0.07202 seconds

Differential Revision: [D73971021](https://our.internmc.facebook.com/intern/diff/D73971021/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152577
Approved by: https://github.com/chenyang78
2025-05-02 02:17:50 +00:00
5b5938929f [refactor] refactor dense implementation of auto_functionalized_v2 for better clarity (#152248)
Abstracts away two helper functions (get_mutable_args_from_schema and _generate_new_op_kwargs_from_bases) to make the code better organized and more re-usable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152248
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073, #152244, #152245, #152246, #152247
2025-05-02 02:08:06 +00:00
380327c663 [hop] make materialize_as_graph's include and exclude dispatch key set optional (#152247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152247
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073, #152244, #152245, #152246
2025-05-02 02:08:06 +00:00
a776a566db [hop][schema] allow adding kw_only info to schema argument (#152246)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152246
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073, #152244, #152245
2025-05-02 02:08:06 +00:00
7e7b9ca18f [hop][be] make check_input_alias_and_mutation_return_ouputs create new fake mode (#152245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152245
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073, #152244
2025-05-02 02:08:06 +00:00
b5995cb67f [BE] Update numba versions (#152557)
Let's see if PyTorch is compatible with latest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152557
Approved by: https://github.com/Skylion007
2025-05-02 01:51:30 +00:00
cyy
ce94b212c7 [Environment Variable][Rebase] Use thread-safe getenv functions (#140200)
Use our thread-safe getenv wrappers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200
Approved by: https://github.com/kwen2501, https://github.com/eqy
2025-05-02 00:41:49 +00:00
a5dd7011a0 [ONNX] Delete JitTraceConvertStrategy (#152556)
Fixes #151703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152556
Approved by: https://github.com/justinchuby
2025-05-02 00:26:43 +00:00
3c54e0c216 [inductor] if unbacked symint in old-size or new-size skip mark_reuse check (#152379)
Probably can run the `mark_reuse` check work with unbacked sizes under certain conditions.
For e.g. `x.repeat(u0, 2).repeat(2, u0)`.

But I think cases like those are rare so skipping the check for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152379
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/jingsh
2025-05-02 00:24:58 +00:00
fdcfc6a61a [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-01 23:01:30 +00:00
64957db6c9 Fix some inductor periodic benchmarks (#152605)
Some were reporting "pass" consistently on https://hud.pytorch.org/
Those are fine to flip.

I filed a separate issue for the now-regressions for AOTI:
https://github.com/pytorch/pytorch/issues/152606. These should be looked
at.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152605
Approved by: https://github.com/eellison, https://github.com/huydhn
2025-05-01 22:18:30 +00:00
7aebb127bf [dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119)
Together with https://github.com/pytorch/pytorch/pull/151962, FIXES https://github.com/pytorch/pytorch/issues/133575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152119
Approved by: https://github.com/jansel
ghstack dependencies: #149707, #151860, #151731, #151962
2025-05-01 21:59:55 +00:00
4555ed8c83 [ca] hide unused scalar int sizes from dynamo (#151962)
together with https://github.com/pytorch/pytorch/pull/151731, FIXES https://github.com/pytorch/pytorch/issues/113129 https://github.com/pytorch/pytorch/issues/146168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151962
Approved by: https://github.com/jansel
ghstack dependencies: #149707, #151860, #151731
2025-05-01 21:59:55 +00:00
18229a5300 [ca] mark scalar int sizes as dynamic via tensor wrapping (#151731)
This is the only way to support dynamic shapes on scalars right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731
Approved by: https://github.com/jansel
ghstack dependencies: #149707, #151860
2025-05-01 21:59:49 +00:00
613bd46272 [aot][ca] save bw_module in AOTAutogradCache (#151860)
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).

Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #149707
2025-05-01 21:59:43 +00:00
c461ba6522 [aot] mark dynamic activations as maybe dynamic (#149707)
Today, we mark graph outputs as maybe dynamic, this lets a compilation to communicate to future compilations whether certain graph inputs are dynamic. Similarly, we can do this to saved activations, which may be used in future compilations as well. This is especially prevalent in compiled autograd, where tensor activations will always become graph inputs.

Changes to the tests were mainly cosmetic, with the exception of tests that relied on duck shaping. By annotating tensor dims, we prevent them from reusing pre-existing symbols, so this change will make graphs use duck shapes less than before, which affects some of the caching tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149707
Approved by: https://github.com/bdhirsh
2025-05-01 21:59:36 +00:00
b6c5886d09 BE: Swap functorch --> torch._higher_order_ops (#152620)
Summary: Discovered when attempting to resolve arvr builds, should resolve issues around utilizing functorch through export.

Test Plan:
```
buck2 test arvr/mode/linux/opt //arvr/libraries/xrrp/ml/python/test:convert_to_etvk_test
```

Differential Revision: D74013898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152620
Approved by: https://github.com/zou3519
2025-05-01 21:53:23 +00:00
1c04ea4e59 Revert "[torchgen] Refactor torchgen.utils.FileManager to accept pathlib.Path (#150726)"
This reverts commit 4b5b1adb21f5d7d66945d78a1f89d2f9d86f15bb.

Reverted https://github.com/pytorch/pytorch/pull/150726 on behalf of https://github.com/malfet due to This breaks Windows builds, see a765e2ddda/1 ([comment](https://github.com/pytorch/pytorch/pull/150726#issuecomment-2845858846))
2025-05-01 21:52:35 +00:00
a765e2ddda [nativert] port enumerate from folly to c10::utill (#152481)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff ports an enumeration util from folly into c10.

Test Plan: CI

Differential Revision: D73881042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152481
Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/cyyever
2025-05-01 21:41:05 +00:00
24b315676d [MPS][BE] Migrate lerp.Scalar.out to tensor iterator (#152514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152514
Approved by: https://github.com/kulinseth, https://github.com/Skylion007, https://github.com/dcci
2025-05-01 20:11:55 +00:00
f1d636f85b [BE] detect CXX pytree requirement with TorchVersion (#151102)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151102
Approved by: https://github.com/zou3519
2025-05-01 18:55:57 +00:00
8cb6957e01 [export] Ignore None buffers (#152571)
Fixes https://github.com/pytorch/pytorch/issues/152467
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152571
Approved by: https://github.com/yiming0416, https://github.com/yushangdi
2025-05-01 18:18:16 +00:00
037343657e Use swap_tensors path in nn.Module.to for all subclasses that override __torch_dispatch__ (#152539)
Fixes https://github.com/pytorch/pytorch/issues/148977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539
Approved by: https://github.com/albanD
2025-05-01 18:04:33 +00:00
4b5b1adb21 [torchgen] Refactor torchgen.utils.FileManager to accept pathlib.Path (#150726)
This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support.

This allows us to simplify the code such as:

1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`.

95a5958db4/torchgen/utils.py (L155)

95a5958db4/torchgen/utils.py (L176)

2. `os.path.basename(...)` with `Path(...).name`.
 95a5958db4/torchgen/utils.py (L161)

3. Manual file extension split with `Path(...).with_stem(new_stem)`

95a5958db4/torchgen/utils.py (L241-L256)

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726
Approved by: https://github.com/zou3519
2025-05-01 17:43:16 +00:00
83acb688bb Fix constant folding cloning constants (#152273)
Summary:
Bug fix for #135060
Simple review:
https://github.com/pytorch/pytorch/pull/135060/files#diff-f23386709ff7e1235b15e18f835a48e5124e0ddd596aeb33c201daad1abbedd7R357
We mistakenly typed get_attr into getattr.

This causes constants never get untagged, and forces all constants get
cloned twice which greatly increases the memory consumption.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_empty_constant_folding

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152273
Approved by: https://github.com/trieuat, https://github.com/zhxchen17
2025-05-01 17:34:39 +00:00
563a91b144 [cutlass backend] Move cutlass compiled cache to cache_dir (#151825)
Moved "compiled_cache.db" to cache folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151825
Approved by: https://github.com/mlazos
2025-05-01 17:26:01 +00:00
1845df05c6 [inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152487
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-01 17:25:28 +00:00
f0c9b3385d Support more dtypes for input, indices in gather (#151822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151822
Approved by: https://github.com/ngimel
2025-05-01 16:35:23 +00:00
4c8dee7986 Revert "[inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384)"
This reverts commit c87c823de43b7815c523160778b682973e151794.

Reverted https://github.com/pytorch/pytorch/pull/152384 on behalf of https://github.com/malfet due to Broke CI, see 52cbcac640/1 ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))
2025-05-01 15:46:08 +00:00
f7b60456cc Revert "[inductor][subgraph] Simplify the resulting output code for subgraph (#152383)"
This reverts commit 98eb7c8cb1abafaff4e28b07ed91cababc2ce54a.

Reverted https://github.com/pytorch/pytorch/pull/152383 on behalf of https://github.com/malfet due to Broke CI, see 52cbcac640/1 ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))
2025-05-01 15:46:08 +00:00
2f1800bc3d Revert "[invoke_subgraph] Simplify output code for subgraph output node (#152490)"
This reverts commit 5fe335810af0df48f473387b6f9efcd5dbff4d4a.

Reverted https://github.com/pytorch/pytorch/pull/152490 on behalf of https://github.com/malfet due to Broke CI, see 52cbcac640/1 ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))
2025-05-01 15:46:07 +00:00
2fa39e60ed Revert "[inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494)"
This reverts commit 5236a8506c4f2fcce6d8a7f945808d84e6c46784.

Reverted https://github.com/pytorch/pytorch/pull/152494 on behalf of https://github.com/malfet due to Broke CI, see 52cbcac640/1 ([comment](https://github.com/pytorch/pytorch/pull/152384#issuecomment-2845099985))
2025-05-01 15:46:07 +00:00
52cbcac640 [BE] Migrate all add/sub ops to Metal kernels (#152510)
As typecasting harness shoudl take care of all permutations
Fix bug in `exec_binary_kernel` where it was not properly downcasting CPU double/complexDouble scalars to floats

Fixes https://github.com/pytorch/pytorch/issues/152582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152510
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #152443, #152466, #152479, #152504, #152485
2025-05-01 15:35:57 +00:00
e82dc0769c Respect checkpointed boundaries when using knapsack formulation in the partitioner (#141684)
When multiple checkpoint regions are back-to-back with no operations in-between, we enforce the operation at the boundary to be force-saved, see 7ea0da2d57/torch/_functorch/partitioners.py (L772-L807)

When using the `memory_budget` formulation on a graph which already has AC inside, we should respect the boundaries of the AC decision (which is set to `MUST_SAVE`), and thus ban those nodes from possible recomputation.

Adding tests would be nice, but not sure what's the best way to test this right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141684
Approved by: https://github.com/bdhirsh
2025-05-01 15:28:41 +00:00
41de0f2eaf removing short-perf-test-cpu.sh and short-perf-test-gpu.sh (#152551)
When working on #148342 I realised that there is no reference from those files. So seems they are stale and can be safely removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152551
Approved by: https://github.com/atalman, https://github.com/xuzhao9
2025-05-01 15:09:55 +00:00
6f6acb4128 [AOTI][CPU] Introduce config.cpp.use_decompose_tanh (#152542)
Summary: Previously D70489427 changed tanh impl to `.tanh()`, and this is causing some meta internal workload perf regression. This diff will introduce a config so we can set it based on need.

Differential Revision: D73909371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152542
Approved by: https://github.com/desertfire
2025-05-01 10:25:31 +00:00
7c63ddd817 [Inductor] Wrapper code refactors to prepare for FX codegen (#152391)
This PR contains some refactors from https://github.com/pytorch/pytorch/pull/146942, which help to enable Wrapper FX codegen:
1. Remove `OutputLine`, which is unused.
2. Add an attribute to the backend classes specifying whether they support caching.
3. Before compiling a graph, query the registered backends and check whether caching is supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152391
Approved by: https://github.com/jansel
2025-05-01 09:14:55 +00:00
701c0848b8 [dynamic shapes] aten.constant_pad_nd meta impl (#152129)
We know the output shape, and we know this always produces a clone. Avoids data-dependent errors from the decomposition.

along with https://github.com/pytorch/pytorch/pull/150483, should fix https://github.com/pytorch/pytorch/issues/123855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152129
Approved by: https://github.com/laithsakka
2025-05-01 08:32:10 +00:00
53bf174626 Fix assertion in reorder_communication_preserving_peak_memory (#152565)
>=0 is practically correct becuase we do model the runtime of some ops as 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152565
Approved by: https://github.com/eellison
2025-05-01 06:40:04 +00:00
47972f9092 [export] warn when Dim.AUTO 0/1 specializes (#151827)
Fixes #151582

example warning for Dim.AUTO:
```
torch/_export/non_strict_utils.py:499] dimension inputs['x'].shape[1] 0/1 specialized; Dim.AUTO was specified along with a sample input with hint = 1.
```

example error when Dim.DYNAMIC specializes:
```
- Received user-specified dim hint Dim.DYNAMIC(min=None, max=None), but export 0/1 specialized due to hint of 0 for dimension inputs['x'].shape[0].
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151827
Approved by: https://github.com/angelayi
2025-05-01 06:00:51 +00:00
a7f1ddc184 [SymmMem] Experimental NVSHMEM integration (#151261)
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151261
Approved by: https://github.com/fegin, https://github.com/fduwjj
2025-05-01 05:24:50 +00:00
13add553b2 [HOP][be] make supports_input_mutation and aliasisng a class field (#152244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152244
Approved by: https://github.com/zou3519
ghstack dependencies: #152072, #152073
2025-05-01 05:22:02 +00:00
447f8241f5 [export][function schema] support exporting hop with function schema argument (#152073)
We need to make function schema proxyable to trace a the auto_functionalized hop that takes function schema as inputs.  The implementation basically follows how we support torchbind object:

1. upon seeing an untracked function schema arg, we creates a constant get_attr node
2. we track the function schema argument in export to support lift/unlift.
3. we need to support serde for functional schema. We'll add support for this in follow-up PRs.

However, compared with torchbind object:
1. we don't need a dynamo implementation, because the function schema is added when we auto_functionalize a hop to the argument of auto_functionalized. One potential use case is users re-traces an exported program with strict mode. Since non-strict is the default now, we don't see a use case yet.
2. we don't need an inductor implementation, because the function schema will go away after auto_functionalized re-inplacing pass.

edit: we greatly simplifies (and generalizes) the implementation following @zou3519 's suggestion of using pytree.register_constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152073
Approved by: https://github.com/zou3519
ghstack dependencies: #152072
2025-05-01 05:22:02 +00:00
500bf50129 [export][be] better type annotation for lift_constants_pass (#152072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152072
Approved by: https://github.com/zou3519
2025-05-01 05:22:02 +00:00
d96193f622 [Inductor] Fix int check again (#152576)
Made an oss change to a diff train diff

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152576
Approved by: https://github.com/wdvr
2025-05-01 05:19:40 +00:00
18588fe2fc Fix GuardOnDataDependentSymNode in the normalize operator (#152039)
Test Plan:
Dumped the local net torch.package to local

Ran
```
buck2 run scripts/shengqin:test_model_export -- /tmp/mtia_local_torch_package {\"local\":null}
```
succeeded

Reviewed By: hongyang-zhao

Differential Revision: D73405271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152039
Approved by: https://github.com/houseroad
2025-05-01 04:34:49 +00:00
cyy
688adc9941 Enable -Wunused on torch targets (#150077)
For GCC, ``-Wunused`` contains:
```
-Wunused-function
Warn whenever a static function is declared but not defined or a non\-inline static function is unused.

-Wunused-label
Warn whenever a label is declared but not used.
To suppress this warning use the unused attribute.

-Wunused-parameter
Warn whenever a function parameter is unused aside from its declaration.
To suppress this warning use the unused attribute.

-Wunused-variable
Warn whenever a local variable or non-constant static variable is unused aside from its declaration
To suppress this warning use the unused attribute.
```
For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default:
```
Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument),
[-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable),
[-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function),
[-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture),
[-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef),
[-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field),
[-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar),
[-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable).
```
These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077
Approved by: https://github.com/zou3519
2025-05-01 04:09:06 +00:00
15a3f58f91 Return ConstantVariable(None) from WithExitFunctionVariable.exit to prevent NoneType crash inside autocast exception path (#152503)
Copy of #152013 with PR time benchmarks updated (regressions seem unrelated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152503
Approved by: https://github.com/anijain2305, https://github.com/Skylion007

Co-authored-by: Witold Dziurdz <wdziurdz@habana.ai>
2025-05-01 04:01:24 +00:00
632b89af43 [dynamic shapes] support SymInt inputs for kthvalue (#152151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152151
Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet
2025-05-01 03:47:23 +00:00
56d6d4dafe [PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450) (#152536)
Summary:

Same with D71358949, but removing newly added log to avoid test failures.

Port over replace_lce_with_matmul and replace_first_lce_with_fused_matmul_lce to PT2 pre_grad pass.
Original dper pass diffs: D67884534, D68123479, D68384238

Test Plan:
Test 1. Covers replace_lce_with_matmul and case 1 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=6 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/669809193/0/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_669809193_0_diable_acc.log
```
Log: P1798246938

Test 2. Covers replace_lce_with_matmul and case 2 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=7 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/677734158/9/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_matmul_lce_replace_normal_LCE" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_677734158_9_diable_acc.log
```
Log: P1798246675

Seeing logs like
`[Pre grad(predispatch IR)] Apply use_matmul_fuse_lce_replace_first_LCE pass, save before/after graph to /tmp/tmp8lyzoh79, graph before/after are the same = False`

Differential Revision: D73934142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152536
Approved by: https://github.com/wdvr
2025-05-01 03:14:04 +00:00
5236a8506c [inductor][invoke_subgraph] Free the buffers before the subgraph call (#152494)
Before
![image](https://github.com/user-attachments/assets/62b24c14-69e6-40fb-94e3-223930132ef6)

After
![image](https://github.com/user-attachments/assets/9f340d4e-80a9-45aa-9400-626fff5b5ecd)

tlparse - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmph5dwWt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152494
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #152357, #152384, #152383, #152490
2025-05-01 02:04:10 +00:00
5fe335810a [invoke_subgraph] Simplify output code for subgraph output node (#152490)
Before - [manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)
![image](https://github.com/user-attachments/assets/8fecdc23-eb78-4e15-9d03-c4bae4b49434)

After fix - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp9a5EM0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
![image](https://github.com/user-attachments/assets/8e98120c-d82e-42dc-bc50-a6bfd4f9923c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152490
Approved by: https://github.com/eellison
ghstack dependencies: #152357, #152384, #152383
2025-05-01 02:04:10 +00:00
98eb7c8cb1 [inductor][subgraph] Simplify the resulting output code for subgraph (#152383)
Check out output code

Before this PR -  - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp3iXDVs/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
![image](https://github.com/user-attachments/assets/ef86eb8f-e8b9-47dd-8609-f90481f018b8)

After this PR - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRgUJvq/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/10e22c60-7fb9-4519-9d54-019beff5333b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152383
Approved by: https://github.com/eellison
ghstack dependencies: #152357, #152384
2025-05-01 02:04:10 +00:00
c87c823de4 [inductor][invoke_subgraph] Remove assertion checks for outputs of invoke_subgraph (#152384)
For invoke_subgraph, input assertions are good. We don't need output assertions. This is the tlparse

Before
![image](https://github.com/user-attachments/assets/4ae14530-3314-4dfa-9297-58f9e3ee4b9c)

After
![image](https://github.com/user-attachments/assets/c1457687-2396-49a7-986b-ef6145fcbf46)

https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmppQg3F8/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152384
Approved by: https://github.com/eellison, https://github.com/zou3519
ghstack dependencies: #152357
2025-05-01 02:04:10 +00:00
3849fd13de 🐛 Add ciflow/pull🦋 (#152567)
To make it easier to workaround GitHub relibability issues, when it sometime fails to scheduled `on: pull_request` workflows

See https://github.com/pytorch/pytorch/issues/151322

But alas, it does not fixes problem at hand...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152567
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/Camyll, https://github.com/atalman
2025-05-01 02:00:51 +00:00
0b8822e70b [export] set is_exporting() for strict (#151833)
Helpful for upcoming work in figuring when to use stack trace in prettifying dynamic shapes errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151833
Approved by: https://github.com/angelayi
2025-05-01 02:00:19 +00:00
f2cc07d202 [cutlass backend] Add addmm dynamic support (#152498)
Differential Revision: [D73893133](https://our.internmc.facebook.com/intern/diff/D73893133/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152498
Approved by: https://github.com/ColinPeppler
2025-05-01 01:40:08 +00:00
fe1deeb701 [BE] Replace func_name with __func__ (#152553)
Summary: Not sure why one needs to preserve the name by hand

Test Plan: CI

Differential Revision: D73941209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152553
Approved by: https://github.com/wdvr
2025-05-01 01:26:49 +00:00
0d2746092b [ez][export] suggest torch._checks only for booleans (#152499)
We were doing this when the error was coming from int/float casts, suggesting fixes like `torch._check(zuf0), torch._check(~zuf0)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152499
Approved by: https://github.com/angelayi
2025-05-01 01:24:46 +00:00
be1adcae32 add split sizes info dump for uneven all2all bw calculation (#151438)
Add split sizes info to dumped execution trace and kineto trace for bw calcuation of uneven all2all.

Take input data as an example from case below, although we know input size of Rank-0 is 50 elements, actual data size that Rank-0 sends out is (12+13+14)=39 elements. Rank-0 doesn't send the 1st chunk of 11 elements to peers. But we don't know this infomation now, because "in split size" filed is empty.
![image](https://github.com/user-attachments/assets/7240f334-2081-409b-bbe0-a8396ffa2d30)
![image](https://github.com/user-attachments/assets/679fc49f-e34f-4a74-bad0-fb6fa9d18239)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151438
Approved by: https://github.com/shengfukevin, https://github.com/kwen2501
2025-05-01 01:19:20 +00:00
eqy
7abca8ceba Decorate test_host_memory_stats with @serialTest (#152454)
Seems to need it as it is expecting only its allocation behavior to be visible, to address #152422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152454
Approved by: https://github.com/Skylion007
2025-05-01 00:53:20 +00:00
5521e6b671 [export] support SymInt minlength for torch.bincount() (#152497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152497
Approved by: https://github.com/angelayi
2025-05-01 00:45:58 +00:00
ad9e209ea3 Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103)
These are the tests for torch._inductor.compile, so I renamed the file
test_compile. This is to avoid confusion with
torch._inductor.standalone_compile, which is now a lot more standalone
than torch._inductor.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152103
Approved by: https://github.com/oulgen
2025-05-01 00:44:02 +00:00
8136e0d3b7 Expose NCCL communicator from ProcessGroupNCCL via an unsafe API (#152496)
Differential Revision: D73892691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152496
Approved by: https://github.com/ngimel
2025-04-30 23:51:34 +00:00
f2a89b802d [invoke_subgraph] Cache on tangent metadata and retrace if needed (#152357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152357
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2025-04-30 23:49:17 +00:00
b6f8209f54 Remove redundant line in partitioner (#152517)
Summary: This is a cleanup from https://github.com/pytorch/pytorch/pull/152264, which contained a line which was a vestige from a previous implementation.

Test Plan: Let CI run

Differential Revision: D73904636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152517
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2025-04-30 23:17:30 +00:00
56039b5778 Revert "[CUDAGraph Trees] support memory allocation on side stream (#152472)"
This reverts commit c620763ec2be83e37f9b31ad6663c6e82d6c0ab0.

Reverted https://github.com/pytorch/pytorch/pull/152472 on behalf of https://github.com/BoyuanFeng due to should use tid instead pid ([comment](https://github.com/pytorch/pytorch/pull/152472#issuecomment-2843491656))
2025-04-30 22:18:10 +00:00
361bf056a7 [nativert] Add moodycamel/concurrentqueue as third-party dependency (#152033)
nativert RFC:  https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

moodycamel/concurrentqueue is a high performence mpmc queue implementation and single header only. We want to add this to third_party to be used with upcoming Torch Native Runtime.

The source code is imported from commit hash 2f09da73d22a47dc8a89cdd4fc4c3bfae07f4284 from https://github.com/cameron314/concurrentqueue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152033
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-04-30 21:37:20 +00:00
49a72011cc Revert "[inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487)"
This reverts commit 76331657d21e4bebd8f3c00ceed5369ae8b64112.

Reverted https://github.com/pytorch/pytorch/pull/152487 on behalf of https://github.com/malfet due to And it broke those tests, not sure why signal was ignored ([comment](https://github.com/pytorch/pytorch/pull/152487#issuecomment-2843333471))
2025-04-30 21:35:17 +00:00
3f10091d3c Clean up conda usage in benchmark scripts (#152552)
Fixes https://github.com/pytorch/pytorch/issues/152123.

* Switch `benchmarks/dynamo/Makefile` to use uv.  Note that these scripts are only used locally, so it's kind of ok to keep conda here IMO.  But switching to uv is probably nicer to most folks.
* Delete some files that are outdated and not used anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152552
Approved by: https://github.com/atalman, https://github.com/albanD
2025-04-30 21:27:29 +00:00
5a66c1d921 [nativert] Add utility function to convert strings into numbers. (#151467)
Summary:

nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff adds a small library to convert strings into numbers which will later be used for parsing graph IR.

Differential Revision: D73133034

## Test Plan

c10 unittests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151467
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-04-30 21:20:52 +00:00
22ecaeb145 [standalone_compile] fix dynamic shapes with config_patches (#152462)
compile_fx with config_patches goes down another path where we need to
propagate the kwarg...

Test Plan:
- updated test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152462
Approved by: https://github.com/oulgen
2025-04-30 21:02:14 +00:00
eqy
ce317cd5a8 [CUDA][SDPA] bump fudge factor in test_sdpa in test_nestedtensor (#152235)
Small mismatches on e.g., 4090, A6000/A40

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152235
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/jbschlosser
2025-04-30 20:24:49 +00:00
55c539428f [inductor][BE] cleanup and improve precompilation loggings (#152483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152483
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-04-30 20:21:55 +00:00
76331657d2 [inductor][BE] Add more debug logs for why fx graph cache doesn't happen (#152487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152487
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-04-30 20:05:21 +00:00
adebb8b112 set thread_work_size to 4 for unrolled kernel (#152396)
Previous PRs enabling 8-vectorization inadvertently regressed unrolled kernel perf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152396
Approved by: https://github.com/BoyuanFeng, https://github.com/msaroufim, https://github.com/malfet, https://github.com/Aidyn-A, https://github.com/atalman
2025-04-30 19:53:58 +00:00
c4a0b31c1d Update CODEOWNERS (torch/utils/data/) (#152482)
Updating codeowners for dataloading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152482
Approved by: https://github.com/ramanishsingh, https://github.com/janeyx99
2025-04-30 19:24:56 +00:00
eqy
1bb13a16bb [CUDA][SDPA] Bump python fused_attention_vs_math_ref_grads fudge_factor for sm120 (#152491)
🍦

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152491
Approved by: https://github.com/Skylion007
2025-04-30 19:22:21 +00:00
7a3cae4b20 Configurable logging for cpp_extensions.py (#152260)
Today `cpp_extensions` makes heavy use of printing to stderr, this makes our life harder in KernelBot where we typically rely on stderr to only surface real errors but instead today cpp_extensions leverages stderr for updates that could be qualified as INFO, WARNING, ERROR

Now instead we'll recommend users of our cpp extension system to do something like

```python
import logging
cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension")
cpp_ext_logger.setLevel(logging.WARNING)
```

While this dramatically reduces log spew, it can be viewed as a BC breaking change if people were relying on certain strings being present in stdout or stderr

Considering different teams might want to silence errors differently, this PR proposes replacing all `print()` statements with `logging` statements with the same heuristics that the python logging module recommends
1. DEBUG: For things like detailed compilation steps or reading filepaths - by default gets logged on stdout
2. INFO: Build progress - by default gets logged on stdout
3. WARNING: Surfacing issues that might cause bad performance or slow compilation times - by default gets logged on stdout
4. ERROR: Problems that prevent proper functioning - by default gets logged on stdout

Note that warnings.warn is a different library and is not hooked up to the python logging module by default

So the goal of this PR is to make it possible for teams to set the logging that is most appropriate to them. One annoying thing is logger throws ruff errors if you try to use it in conjunction with f strings or .format so have to use old school %s

An unrelated improvement I'd be happy to push to a seperate PR is adding support for "native" in `TORCH_CUDA_ARCH_LIST` which would just pick the ARCH for the current device

An example of what's in stderr today

```
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/grayscale/build.ninja...
/usr/local/lib/python3.11/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module grayscale...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module grayscale...
/usr/local/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:679: UserWarning: Graph break due to unsupported builtin grayscale.PyCapsule.grayscale. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
```

Whereas after this PR users can do

`python benchmark_load_inline.py > >(tee stdout.txt) 2> >(tee stderr.txt >&2)`

```python
import os
import sys
from pathlib import Path
import shutil
import tempfile

import torch
from torch.utils.cpp_extension import load_inline

import logging
cpp_ext_logger = logging.getLogger("torch.utils.cpp_extension")
cpp_ext_logger.setLevel(logging.WARNING)

os.environ["TORCH_CUDA_ARCH_LIST"] = "native"

cpp_code = """
torch::Tensor to_gray(torch::Tensor input);
"""

cuda_kernel_code = """
torch::Tensor to_gray(torch::Tensor input) {
  auto output = torch::epty({input.size(0), input.size(1)}, input.options());
  return output ;
}
"""

# Avoid caching results
with tempfile.TemporaryDirectory() as build_dir:
    cuda_module = load_inline(
        name="to_gray_cuda",
        cpp_sources=cpp_code,
        cuda_sources=cuda_kernel_code,
        functions=["to_gray"],
        with_cuda=True,
        verbose=True,
        extra_cflags=["-std=c++17"], # "-ftime-report", "-H"],
        extra_cuda_cflags=["-arch=sm_89"],
        build_directory=build_dir,
    )

```

## New logs

### On failure

Which gives a much more reasonable stdout

```
[1/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o
FAILED: cuda.cuda.o
/usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpbg_xzv0r/cuda.cu -o cuda.cuda.o
/tmp/tmpbg_xzv0r/cuda.cu(6): error: namespace "torch" has no member "epty"
    auto output = torch::epty({input.size(0), input.size(1)}, input.options());
                         ^

1 error detected in the compilation of "/tmp/tmpbg_xzv0r/cuda.cu".
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpbg_xzv0r/main.cpp -o main.o
ninja: build stopped: subcommand failed.

```

And stderr

```
Traceback (most recent call last):
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2874, in _run_ninja_build
    subprocess.run(
  File "/home/marksaroufim/.conda/envs/nv/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/marksaroufim/load_inline_slow/benchmark_load_inline.py", line 30, in <module>
    cuda_module = load_inline(
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2261, in load_inline
    return _jit_compile(
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2367, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2528, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/marksaroufim/pytorch/torch/utils/cpp_extension.py", line 2892, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'to_gray_cuda'

```

### On success

stdout

```
[1/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -fPIC -std=c++17 -std=c++17 -c /tmp/tmpxv_ovlrf/main.cpp -o main.o
[2/3] /usr/local/cuda-12.8/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=to_gray_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /home/marksaroufim/pytorch/torch/include -isystem /home/marksaroufim/pytorch/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/marksaroufim/.conda/envs/nv/include/python3.10 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -arch=sm_89 -std=c++17 -c /tmp/tmpxv_ovlrf/cuda.cu -o cuda.cuda.o
[3/3] c++ main.o cuda.cuda.o -shared -L/home/marksaroufim/pytorch/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-12.8/lib64 -lcudart -o to_gray_cuda.so

```

And an empty stderr as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152260
Approved by: https://github.com/albanD
2025-04-30 18:30:28 +00:00
05933e08ca [ATen][CUDA][SDPA] Enable SDPA on sm_121 (#152314)
This PR adds support for `sm_121` of the DGX Spark. The `sm_121` is binary compatible with `sm_120` (just like `sm_89` and `sm_86`), therefore a compilation targeting `sm_121` is not required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152314
Approved by: https://github.com/eqy
2025-04-30 18:04:50 +00:00
b027cb8f9e [Docs] Add Description of validate_args for torch.distributions (#152173)
Fixes #152165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152173
Approved by: https://github.com/soulitzer
2025-04-30 18:01:20 +00:00
cyy
256c96332c [1/N] Use std::filesystem (#152288)
Maybe it is time to use std::filesystem because CXX11 ABI is now the default. The changes are for jit and distributed code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152288
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-30 17:54:16 +00:00
62ab6a5bb1 [ROCm] Use almalinux docker files for building Magma (#152488)
Fixes #151707 for ROCm Magma builds.  See also #152358.  Depends on #152492.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152488
Approved by: https://github.com/atalman
2025-04-30 17:53:30 +00:00
c620763ec2 [CUDAGraph Trees] support memory allocation on side stream (#152472)
I tried `beginAllocateToPool` instead of `_cuda_beginAllocateCurrentStreamToPool` and the error in #151199 does not happen any more.

However, this approach is unsafe for multithreading. When multiple run_eager happens concurrently, we expect memory allocation to different mem_pool. Since beginAllocateToPool does not check stream, these memory allocation may happen on the same mem_pool.

So, I use `_cuda_beginAllocateCurrentThreadToPool` to direct all memory allocation on the same thread to a given mem_pool. In particular, `_cuda_beginAllocateCurrentThreadToPool` records the launching thread id, and during runtime checks if the current thread id matches the launching thread id.

Fixes #151199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152472
Approved by: https://github.com/eellison
2025-04-30 17:45:07 +00:00
0904a182c2 [dynamo] Relax guard introduced when tracing __call__ on user defined object (#152395)
This relaxes the guard introduced in #100444 (which aggressively guard
on the object id, despite Dynamo is just tracing its `__call__` method.

This allows users to bypass the high compilation time issue in #150706
by compiling transformer blocks only. Without this patch, we'd get lots
of unnecessary recompilation, as the block has difference attention
processor instances.

Compiling blocks only _significantly_ speeds up compilation process
(from ~310s to ~32s), and even speeds up e2e performance for some reason
(7.83s to 7.67s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152395
Approved by: https://github.com/anijain2305
ghstack dependencies: #152369
2025-04-30 17:34:21 +00:00
e4994e2f73 [AOTAutogradCache] Allow torch.Tensor and a non-torch op from einops (#152369)
This addresses part of #150706.

Specifically, it reduces the warm start `torch.compile` overhead by
40~50% for GGUF models on
1. HuggingFace diffusers: [tlparse before, 224s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpqgbdva/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 126s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp950PFy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)
2. ComfyUI: [tlparse before, 93s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp7SeJb4/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) v.s. [tlparse after, 51s](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpRwGNqA/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)

The improvements should generalize to all other GGUF models on these
platforms, because the cache miss was induced by framework code, which
will be hit by every GGUF model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152369
Approved by: https://github.com/jamesjwu
2025-04-30 17:34:21 +00:00
ce2cf31623 Remove dead binary_ios_build, test, upload scripts (#152461)
Can't find any mentions of them in the codebase, presumably no longer used?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152461
Approved by: https://github.com/seemethere, https://github.com/janeyx99, https://github.com/malfet
2025-04-30 17:10:27 +00:00
702264dad4 Revert "Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103)"
This reverts commit ff1099562d261315ac7bbf43f3795872099a1c31.

Reverted https://github.com/pytorch/pytorch/pull/152103 on behalf of https://github.com/clee2000 due to failure is real but log classifier is pointing at an unrelated line, actual failure is just that the old name is mentioned somewhere and needs to be changed, see the bottom of the test step of the job https://github.com/pytorch/pytorch/actions/runs/14740884246/job/41379127184#step:22:705 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14758321324/job/41434697413) [HUD commit link](ff1099562d) ([comment](https://github.com/pytorch/pytorch/pull/152103#issuecomment-2842638551))
2025-04-30 16:57:58 +00:00
8aa65780f4 [CUDA] Fix test_multi_device_context_manager on CUDA (#152474)
Seems there was a typo where `set_device` was called when the intent was to use `current_device`

As-is the test will fail on multigpu systems with

`TypeError: set_device() missing 1 required positional argument: 'device'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152474
Approved by: https://github.com/Skylion007
2025-04-30 16:53:10 +00:00
1e4bcd3ba3 Remove unnecessary condition compilation macro (#152512)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152512
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-04-30 16:48:25 +00:00
3b105ccc04 [AOTI] Fix a memory leak in model_package_loader (#152334)
Summary: There was a char array allocated but never freed. It was found by valgrind and verified fixed with this PR, although it's not easy to write a unit test for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152334
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-04-30 16:21:50 +00:00
c7484805ca Add two missing JIT tests to CMake (#152440)
Looks like I forgot to add these.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152440
Approved by: https://github.com/Skylion007
2025-04-30 16:18:55 +00:00
ff1099562d Change test/inductor/test_standalone_compile to test/inductor/test_compile (#152103)
These are the tests for torch._inductor.compile, so I renamed the file
test_compile. This is to avoid confusion with
torch._inductor.standalone_compile, which is now a lot more standalone
than torch._inductor.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152103
Approved by: https://github.com/oulgen
2025-04-30 15:27:44 +00:00
3c2bf24786 [ROCm] add almalinux images (#152492)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152492
Approved by: https://github.com/atalman
2025-04-30 15:14:01 +00:00
d88e0ceb64 Cast to unsigned char to avoid UB (#152360)
The standard requires that the argument to functions like `isdigit`, `isalpha`, and similar must be either `EOF` or an `unsigned char`; otherwise, the behavior is undefined (UB).
To avoid out-of-bounds reads, modern implementations of some libraries (such as glibc) deliberately pad their internal tables to guarantee valid memory access even for negative values. However, this is implementation-specific, and other libraries may not do this.

Properly casting the argument to `unsigned char` is good practice to avoid potential issues on some platforms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152360
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-04-30 15:09:13 +00:00
4408701fed [CI][CD] Unify install_cuda and install_cuda_aarch64 scripts (#152140)
Generalize install_cuda so it can also handle aarch64
Remove install_cuda_aarch64 since install_cuda can now handle it
Make install_cuda and install_cudnn functions in the install_cuda script because most of the code is the same

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152140
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-04-30 15:09:06 +00:00
371999782a Revert "Fix flaky test in test_custom_ops (#152484)"
This reverts commit 5a52e050248c71dd6e84f51d25cbd17a88555800.

Reverted https://github.com/pytorch/pytorch/pull/152484 on behalf of https://github.com/malfet due to It broke test_save to file with TypeError: get_sample_op_profile() missing 1 required argument ([comment](https://github.com/pytorch/pytorch/pull/152484#issuecomment-2842254907))
2025-04-30 14:53:15 +00:00
d620fefb2c [invoke_subgraph] Use backward identifier for min-cut parititioning (#152207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152207
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2025-04-30 14:34:56 +00:00
cf894b3f1f [MPS][BE] Remove exec_binary_alpha_kernel (#152485)
Which was almost a complete copy-n-paste from exec_binary_kernel anyway
Just add `Scalar` as an optional argument and figure out kernel name during the invocation rather than in executor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152485
Approved by: https://github.com/Skylion007
ghstack dependencies: #152443, #152466, #152479, #152504
2025-04-30 14:09:14 +00:00
c90e23eb73 [inductor] Fix usage of launch_enter_hook/launch_exit_hook (#152457)
In https://github.com/triton-lang/triton/pull/6467 I moved where `launch_enter_hook`/`launch_exit_hook` are specified (from the kernel class to a config). This PR updates the usages to use the config module if it exists to support tip of main triton.

In https://github.com/triton-lang/triton/pull/6641 I renamed `triton.config` to `triton.knobs`, hence the second commit in this PR.

Test Plan: Setup OSS PT with tip of main triton (namely including https://github.com/triton-lang/triton/pull/6641) and run `python test/inductor/test_pad_mm.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152457
Approved by: https://github.com/jamesjwu
2025-04-30 13:22:16 +00:00
36acaaae3f [CUDA] Add new architectures (#152414)
CUDA 12.9 will introduce a couple of new architectures `sm_103` and `sm_121`. We do not need to build for them, because they are going to be compatible with`sm_100` and `sm_120` respectively (similar to `sm_86` and `sm_89`), but PyTorch must be "aware" of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152414
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet
2025-04-30 09:55:27 +00:00
ece1658418 [ROCm][TunableOp] Fix ScaledGEMM rowwise (#152403)
Fixes TunableOp ScaledGEMM regression for rowwise scaling caused by this https://github.com/pytorch/pytorch/pull/147548

Credit goes to @mawong-amd for fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152403
Approved by: https://github.com/jeffdaily
2025-04-30 08:33:03 +00:00
7a9d0d2451 Revert "[PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450)"
This reverts commit c8f48eb18531e4e348fcfa718b2e52d3c2497197.

Reverted https://github.com/pytorch/pytorch/pull/152450 on behalf of https://github.com/wdvr due to still failing after https://github.com/pytorch/pytorch/pull/152493 - needs further investigation ([comment](https://github.com/pytorch/pytorch/pull/152450#issuecomment-2841212970))
2025-04-30 08:30:57 +00:00
424e21ae82 Revert "fix tests broken after #152450 (#152493)"
This reverts commit d8fe6fa280c3e5bd21b3e84b3e25d9204ccdedf7.

Reverted https://github.com/pytorch/pytorch/pull/152493 on behalf of https://github.com/wdvr due to still failing ([comment](https://github.com/pytorch/pytorch/pull/152493#issuecomment-2841207942))
2025-04-30 08:27:58 +00:00
fa6f9eb2be [CUDA][TF32] Account for TF32 in compile_kernel_advanced (#152468)
Also cleanup some uses of `assert_close` in favor of `self.assertEqual`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152468
Approved by: https://github.com/msaroufim
2025-04-30 07:54:38 +00:00
d8fe6fa280 fix tests broken after #152450 (#152493)
Updating test expected value after #152450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152493
Approved by: https://github.com/huydhn, https://github.com/malfet

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-04-30 07:16:10 +00:00
5a52e05024 Fix flaky test in test_custom_ops (#152484)
Hopefully fixes https://github.com/pytorch/pytorch/issues/151301, https://github.com/pytorch/pytorch/issues/151281 by making the ops have different names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152484
Approved by: https://github.com/zou3519
2025-04-30 07:07:27 +00:00
cc7346bf19 Revert "fix tests broken after #152450 (#152493)"
This reverts commit 4df97a883949564aa4ed20b6912c3eb664d2624c.

Reverted https://github.com/pytorch/pytorch/pull/152493 on behalf of https://github.com/huydhn due to Another tweak is needed https://github.com/pytorch/pytorch/actions/runs/14748144909/job/41399954902, seem easier to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152493#issuecomment-2841010528))
2025-04-30 07:05:58 +00:00
59a8aa1489 Fix instantiate_device_type_tests() for 3rd-party devices (#152177)
For 3rd-party devices now, `` instantiate_device_type_tests()`` with explicitly passing ``str`` obj (rather than `List[str]/Tuple[str]`) to argument ``only_for`` or ``except_for`` would causes unexpected results.

For example, if calling ``instantiate_device_type_tests(TestXXX, globals(), only_for="cpu")``, then it goes into [filter_desired_device_types()](f38dae76ee/torch/testing/_internal/common_device_type.py (L729)) and results in ``only_for=['c', 'p', 'u']`` because ``only_for`` we passed is  a "cpu" string.

This PR fixes the above unexpected behavior for ``str`` case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152177
Approved by: https://github.com/albanD
2025-04-30 06:25:59 +00:00
a2c553cac6 [Metal] Extend typecasted op support to complex dtypes (#152504)
First of all, by extending `c10:🤘:cast_to` to work correctly with complex dtypes, by introducing two more specializations: one that casts complex to scalar, and another that casts scalar to complex (as default metal typecast will turn `float x` into `float2(x, x)`)

Add ComplexHalf and ComplexFloat enum values to `c10:🤘:ScalarTypes` and handle them in `val_at_offs(ptr, offs, type)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152504
Approved by: https://github.com/dcci
ghstack dependencies: #152443, #152466, #152479
2025-04-30 05:32:07 +00:00
4df97a8839 fix tests broken after #152450 (#152493)
Updating test expected value after #152450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152493
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-30 04:55:55 +00:00
fcfa6e36c9 [MPS] Fix lerp for complex numbers (#152479)
As well as `.add`/`.sub` with complex alpha

Before this change `python3 -c "import torch;print(torch.rand(10, device='mps', dtype=torch.complex64).add(torch.rand(10, device='mps', dtype=torch.complex64), alpha=.5j))"` used to fail with
```
RuntimeError: value cannot be converted to type double without overflow
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152479
Approved by: https://github.com/dcci
ghstack dependencies: #152443, #152466
2025-04-30 04:46:19 +00:00
9bfdf57572 [MPS][BE] Introduce c10:🤘:mul (#152466)
Which multiplies two arguments for either scalar or complex data types

This allows one to get rid of bunch of complex specialization in BinaryOps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152466
Approved by: https://github.com/dcci
ghstack dependencies: #152443
2025-04-30 04:45:47 +00:00
ee2d104c05 [cutlass backend] Add (limited) bmm dynamic shape support (#152393)
Differential Revision: D73626732

In this PR, we add support for bmm dynamic shape, provided that the batch stride is the biggest in the stride for A, B, and D. For example, for A of size `(B, M, K)`, we support stride `(M*K, K, 1)` and `(M*K, 1, M)`. With this assumption, we can infer the batch stride from existing arguments.

The reason is we don't want to add 2-3 more runtime params. The concerns are complexity and possible perf regression, though we didn't verify the latter.

We can revisit this if there is a need for that.

We also remove `B = 1` for normal mm and addmm. We tested it and didn't see perf regression. But open to revisiting this as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152393
Approved by: https://github.com/ColinPeppler
2025-04-30 04:36:24 +00:00
e5ea7911ea [ez] Make relaxed constraint error message more user friendly (#151407)
Fixes #151356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407
Approved by: https://github.com/Skylion007
2025-04-30 03:51:50 +00:00
c01bcc5efb [MPS][BE] Delete unused lerp functors (#152443)
For `lerp.Scalar_out` weight (aka alpha) is not an optional argument, so no point in having those specializations.
But move `alpha=1.0` ahead of dispatching to Metal shaders, as plain copy of tensor should still be faster a1a4fee3b8/aten/src/ATen/native/mps/operations/BinaryOps.mm (L285-L290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152443
Approved by: https://github.com/Skylion007
2025-04-30 03:32:52 +00:00
4a63cab624 [cudagraphs] Fix issue in collecting static_input_idxs (#152287)
related to https://github.com/pytorch/pytorch/issues/152275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287
Approved by: https://github.com/bdhirsh, https://github.com/eellison

Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
2025-04-30 03:24:05 +00:00
bce7f0a216 Fix additional inputs to error on inconsistent constants (#151970)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151970
Approved by: https://github.com/pianpwk
2025-04-30 01:38:17 +00:00
4bead7b85e use cutlass native BroadcastPtrArray in scaled group gemm (#152404)
After cutlass update to 3.9 we can use BroadcastPtrArray instead of a local copy with small changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152404
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-30 01:17:28 +00:00
eqy
cc072af74a [CUDA][MXFP8] bump tolerances for test_blockwise_mxfp8_nvfp4_numerics (#151811)
got a slightly lower sqnr on a smaller GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151811
Approved by: https://github.com/albanD
2025-04-30 01:12:51 +00:00
bea7d428bc [export] Preserve custom metadata for tensor constants (#152241)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/151476
The `custom_meta` collected from `mod` has keys that follow name of nodes in `mod`, which are inconsistent with the node names after the naming pass. For example a constant `b` will become `c_b`.

Test Plan: buck2 run caffe2/test:test_export -- -r test_run_decompositions_keep_tensor_constant_metadata

Differential Revision: D73703068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152241
Approved by: https://github.com/angelayi
2025-04-30 00:30:35 +00:00
d36b09ca58 [aten] Enable vectorized 8byte copy for fp16/bf16 for index select kernel (#152380)
## Summary

Enable aligned vector loading for 2 bytes data types for index select. Specifically:

- **4 element fp16/bf16 packing**: added 8-byte vector load/store to move 4 half values at once.
- **warp-wide predicate (__all_sync)**: decide fast vs fallback path per warp, eliminating lane level divergence
- **alignment guard**: fast or vectorized path only executes when src and dst are 8 byte aligned, preventing mis aligned address faults.
- **Safe for loop fallback**: for misaligned, strid > 1, or tail elements we recompute offsets per element to avoid memory corruption.
- **Bound checks**: fast or vectorized path is skipped when less than 4 elements are remaining, guaranteeing bounded access.
- **Stride remapping**: Redirect calls to inner contiguous dim which has stride = 1 so copies occur along memory coalesced axes.
- **AMD support**: Ensured portability and correctness across CUDA and HIP platforms.

## Perf testing
We note a 2.5x improvement in memory bandwidth after this change when the tensor dim is a multiple of 4 for 2 byte data types (fp16/bf16).

<img width="625" alt="image" src="https://github.com/user-attachments/assets/909b04a3-98f2-4c30-8c29-c36e1beeea0f" />

With input tensor dimension not being a multiple of 4, we see a smaller improvement (~1.2x) due to warp divergence.
<img width="624" alt="image" src="https://github.com/user-attachments/assets/f3ed16f4-b091-48bd-9889-093f6a90688d" />

## Perf testing code
```
# pyre-strict
from typing import List, Optional, Tuple

import click
import pandas as pd

import torch

# @manual=//triton:triton
import triton

@click.command()
@click.option("--data-type", type=str, default="bf16")
@click.option("--return-result", type=bool, default=False)
def main(
    data_type: str,
    return_result: bool,
) -> Optional[Tuple[List[triton.testing.Benchmark], List[pd.DataFrame]]]:
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True
    data_types = {"fp32", "fp16", "bf16"}
    if data_type not in data_types:
        raise ValueError(f"Unsupported data type: {data_type}.")

    dtype = {
        "fp32": torch.float32,
        "fp16": torch.float16,
        "bf16": torch.bfloat16
    }[data_type]

    D1 = 192
    D2 = 156
    configs: List[triton.testing.Benchmark] = [
        triton.testing.Benchmark(
            x_names=["B"],
            x_vals=[24],
            line_arg="provider",
            line_vals=[
                "repeat_interleave",
                "repeat_interleave_int32",
            ],
            line_names=["repeat_interleave", "repeat_interleave_int32"],
            styles=[("red", "-"), ("purple", "-")],
            ylabel="ms",
            plot_name=f"torch-repeat_interleave-D1-{D1}-D2-{D2}-dtype-{dtype}",
            args={
                "D1": D1,
                "D2": D2,
                "dtype": dtype,
            },
        )
    ]

    @triton.testing.perf_report(configs)
    def bench_repeat_interleave(
        B: int,
        D1: int,
        D2: int,
        dtype: torch.dtype,
        provider: str,
    ) -> float:
        warmup = 20
        rep = 100
        torch.manual_seed(42)
        torch.cuda.manual_seed(42)

        a = torch.randn(24, D1, D2)
        a = a.to(dtype).to("cuda")

        input_bytes = a.numel() * a.element_size()

        repeats = torch.randint(low=100, high=1600, size=(24,), device="cuda")
        output_bytes = (
            repeats.sum() * a.shape[1] * a.shape[2] * repeats.element_size()
        )
        total_bytes = input_bytes + output_bytes

        def torch_repeat_interleave(
            input_tensor: torch.Tensor, repeats: torch.Tensor
        ) -> torch.Tensor:
            res = input_tensor.repeat_interleave(repeats, dim=0)
            return res

        def torch_repeat_interleave_int32(
            input_tensor: torch.Tensor, repeats: torch.Tensor
        ) -> torch.Tensor:
            dim = 0
            if torch.is_tensor(repeats):
                idx64 = torch.repeat_interleave(
                    torch.arange(
                        0,
                        input_tensor.shape[dim or 0],
                        device=input_tensor.device,
                    ),
                    repeats,
                    dim=0,
                )
            else:
                idx64 = (
                    torch.arange(
                        input_tensor.shape[dim or 0] * repeats,
                        device=input_tensor.device,
                    )
                    .reshape(-1, repeats)
                    .flatten()
                )

            idx32 = idx64.to(torch.int32)
            res = torch.index_select(input_tensor, 0, idx32)
            return res

        def expand_flatten(input_tensor: torch.Tensor) -> torch.Tensor:
            return input_tensor[:, None].expand(-1, 4, -1).flatten(0, 1)

        if provider == "repeat_interleave":
            fn = lambda: torch_repeat_interleave(a, repeats)  # noqa E731
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            bw = total_bytes / (ms * 1e6)
            # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6))
            return bw.item()
        if provider == "repeat_interleave_int32":
            fn = lambda: torch_repeat_interleave_int32(a, repeats)
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            bw = total_bytes / (ms * 1e6)
            # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6))
            return bw.item()
        elif provider == "expand_flatten":
            fn = lambda: expand_flatten(a)
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            bw = total_bytes / (ms * 1e6)
            # print("Bandwidth[GB/s]: ", total_bytes / (ms * 1e6))
            return bw.item()
        else:
            raise ValueError(f"unsupported provider: {provider}")

    df = bench_repeat_interleave.run(print_data=True, return_df=True)

    if return_result:
        return configs, df

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152380
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-04-29 23:54:52 +00:00
c6d3b8f861 add xfail for distributed tests on Jetson (#152224)
We are hitting distributed import failures on Jetson in test/export/test_export.py tests in NVIDIA internal testing with the recent additions of https://github.com/pytorch/pytorch/pull/146050 and https://github.com/pytorch/pytorch/pull/147417. Instead of simply skipping these tests for Jetson, we are introducing an xfailIfDistributedNotSupported to get better signaling for this kind of failure in the long run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152224
Approved by: https://github.com/nWEIdia, https://github.com/eqy
2025-04-29 23:48:40 +00:00
6f8023a35f [PowerPC] Fix vec256 for complex float and double in Power system (#152402)
Power System build is failing with below error.

After this commit it is failing:
912102b4ec

Fix the build error along with test cases that are failing for complex double and complex float data type.

Build Failure Logs:
```
vec_base.h:790:6: error: use of deleted function ‘at::vec::DEFAULT::ComplexDbl& at::vec::DEFAULT::Vectorized<c10::complex >::operator’
790 | c[i] = a[i] * b[i];
| ~^
error: use of deleted function ‘at::vec::DEFAULT::ComplexDbl& at::vec::DEFAULT::Vectorized<c10::complex >::oper
ator’
802 | c[i] = a[i] / b[i];
| ~^

error: use of deleted function ‘at::vec::DEFAULT::ComplexFlt& at::vec::DEFAULT::Vectorized<c10::complex >::opera
tor’
790 | c[i] = a[i] * b[i];
| ~^

error: use of deleted function ‘at::vec::DEFAULT::ComplexFlt& at::vec::DEFAULT::Vectorized<c10::complex >::opera
tor’
802 | c[i] = a[i] / b[i];
| ~^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152402
Approved by: https://github.com/malfet
2025-04-29 23:45:49 +00:00
c8f48eb185 [PT2] Port replace_lce_with_matmul / replace_first_lce_with_fused_matmul_lce to PT2 pre_grad passes (#152450)
Summary:
Port over replace_lce_with_matmul and replace_first_lce_with_fused_matmul_lce to PT2 pre_grad pass.
Original dper pass diffs: D67884534, D68123479, D68384238

Test Plan:
Test 1. Covers replace_lce_with_matmul and case 1 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=6 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/669809193/0/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_669809193_0_diable_acc.log
```
Log: P1798246938

Test 2. Covers replace_lce_with_matmul and case 2 of replace_first_lce_with_fused_matmul_lce
```
CUDA_VISIBLE_DEVICES=7 TORCH_LOGS=+inductor,aot TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/677734158/9/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --add_passes="use_matmul_fuse_lce_replace_first_LCE,use_matmul_lce_replace_normal_LCE" --batch-size=3072 --gpu-trace --disable_acc_tracer=true 2>&1 | tee ~/logs/disable_acc_tracer/aoti_cmf_ctr_triton_677734158_9_diable_acc.log
```
Log: P1798246675

Seeing logs like
`[Pre grad(predispatch IR)] Apply use_matmul_fuse_lce_replace_first_LCE pass, save before/after graph to /tmp/tmp8lyzoh79, graph before/after are the same = False`

Reviewed By: huxintong

Differential Revision: D71358949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152450
Approved by: https://github.com/huxintong
2025-04-29 23:45:20 +00:00
e872bf8f88 Avoid linking multiple OMP runtimes in libtorch_cpu.so if BLAS used is OpenBLAS. (#147725)
When PyTorch is built with OpenBLAS support and libopenblas is ldrectly linked with libgomp.so the libtorch_cpu.so ends up getting multiple omp runtimes linked against it. This may result in unexpected runtime behaviour /regression. This patch fixes this by avoiding linking against libomp.so if OpenBLAS is linked against libgomp.so

Fixes #146603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147725
Approved by: https://github.com/albanD
2025-04-29 23:39:48 +00:00
a1a4fee3b8 Native channel shuffle floating point exception (#144010)
Fixes #142453

Added TORCH_CHECKS to prevent the user from using the native_channel_shuffle function incorrectly and getting a "Floating point exception (core dumped)"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144010
Approved by: https://github.com/albanD
2025-04-29 23:38:54 +00:00
8f420a500a Save/load op profiles (#151817)
Add ability to save/load op profiles into a yaml file:
```python
op_profile = self.get_sample_op_profile()

# Save
save_op_profiles(op_profile, "op_profile.yaml")
# Load
loaded = load_op_profiles("op_profile.yaml")

assert op_profile == loaded
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151817
Approved by: https://github.com/zou3519
2025-04-29 23:11:32 +00:00
8358eca2ce [Cutlass] Only run EVT tests on sm90 (#151713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151713
Approved by: https://github.com/masnesral
ghstack dependencies: #152305, #152306, #150905, #151405
2025-04-29 23:06:01 +00:00
a1f6d85b36 [Cutlass] Fixes for e2e compilation in arg rendering (#151405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151405
Approved by: https://github.com/eellison
ghstack dependencies: #152305, #152306, #150905
2025-04-29 23:06:01 +00:00
a0ce5ce6e4 [Cutlass] Implement cutlass epilogue visitor python codegen (#150905)
This PR implements the second codegen task of CUTLASS EVT: translating inductor epilogue nodes into python code that will be traced by the EVT infra.

Details:
The implementation uses a simple ops wrapper which only supports add and mul pointwise ops today (to be extended in the future). This ops wrapper generates python code from inner_fn of the epilogue nodes in the format EVT expects. The main caveat is that one of the outputs needs to be named "D" and the accumulator input needs to be named "acc". Reads/writes are named according to the inductor buffer names otherwise.

Previously merged:
* #150904
* #150903
* #150346
* #150345
* #150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150905
Approved by: https://github.com/eellison
ghstack dependencies: #152305, #152306
2025-04-29 23:05:55 +00:00
72273bef9e [Cutlass] Fix int check in example tensor creation (#152306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152306
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #152305
2025-04-29 23:05:47 +00:00
4293a6095d [Cutlass] Remove unused dtype conversion map (#152305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152305
Approved by: https://github.com/Skylion007
2025-04-29 23:05:41 +00:00
a4a771648a [pt2d] Add reorder_comms_preserving_peak_memory pass (#146562)
This is a new pass to replace the pre-existing passes.  It has the same
basic goal, to achieve communication overlap (latency hiding), but also
constrains the solution to not increase peak memory.

The principles of operation are detailed in code comments, but
summarized here:
- never reorder collectives relative to each other (TBD if we should
  relax this later)
- before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies
- estimate peak memory and current memory at each scheduler node
- move collective nodes forward one position at a time, if the move does
  not increaes curr memory beyond peak memory

The pass logs a summary table for each graph to TORCH_LOGS=overlap.

e.g. (exact format may have been tweaked but this shows the idea).

```
rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node                                                                                                                                                initial exposed    final exposed    improvement  limiting factor        moves
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] -----------------------------------------------------------------------------------------------------------------------------------------------------------  -----------------  ---------------  -------------  -------------------  -------
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2')  (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns)               12141.6          6514.53       5627.08   prefetch limit            75
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6')  (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns)                 32265.8         28429.2        3836.61   data dependency           78
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9')  (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns)                         10800.6         10732.3          68.254  peak memory                1
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14')  (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns)                          10809.5         10809.5           0      data dependency            4
[rank
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146562
Approved by: https://github.com/eellison
ghstack dependencies: #152060, #146561
2025-04-29 22:51:31 +00:00
e35e31697e Revert "[MPS][BE] Delete unused lerp functors (#152443)"
This reverts commit 0a2d3206a82c4a5c923938cf0a0ebc0f47aa17dd.

Reverted https://github.com/pytorch/pytorch/pull/152443 on behalf of https://github.com/wdvr due to failing MPS test: test/test_optim.py::TestOptimRenewedMPS::test_can_load_from_to_named_state_dict_is_named_optim0_False_is_named_optim1_False_Adafactor_mps_float32 ([comment](https://github.com/pytorch/pytorch/pull/152443#issuecomment-2840405966))
2025-04-29 22:50:23 +00:00
fecaa60c3c Revert "Add detailed triton kernel logging to tlparse (#152197)"
This reverts commit 8303860de779da840316dd95ce3051e0a4119174.

Reverted https://github.com/pytorch/pytorch/pull/152197 on behalf of https://github.com/wdvr due to failing     python test/dynamo/test_structured_trace.py StructuredTraceTest.test_cudagraphs on trunk ([comment](https://github.com/pytorch/pytorch/pull/152197#issuecomment-2840400839))
2025-04-29 22:47:48 +00:00
471025c489 Revert "[AOTI][reland] Remove typedef for half and bfloat16 (#151109)"
This reverts commit a0d440a26a555c34e87b90bef3bff960b34bb180.

Reverted https://github.com/pytorch/pytorch/pull/151109 on behalf of https://github.com/wdvr due to causing AOTI test failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/151109#issuecomment-2840386483))
2025-04-29 22:37:16 +00:00
accffef504 Run link checks on modified files on push too (#152464)
https://github.com/pytorch/pytorch/issues/152439
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152464
Approved by: https://github.com/huydhn
2025-04-29 22:08:40 +00:00
89c0c3ca80 Add private config to broadcast rank0 decision from the partitioner to all ranks (#152264)
Summary: This PR adds a private configuration to the partitioner that ensures that the decision taken is the same across all ranks. This is a temporary workaround, as when size_hints are also taken into account in compiler collectives this workaround will not be needed anymore.

Test Plan:
This has been tested on some internal models, but I haven't added any tests in PyTorch (yet?)
T

Differential Revision: D73666017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152264
Approved by: https://github.com/bdhirsh
2025-04-29 21:27:57 +00:00
28efeb1522 Remove unused Manylinux2014 Docker files and builds (#152428)
Related to Manylinux 2.28 migration: https://github.com/pytorch/pytorch/issues/123649
Cleanup old Docker files and `manylinuxaarch64-builder:cpu-aarch64` image which has been replaced by `manylinux2_28_aarch64-builder:cpu-aarch64`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152428
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-04-29 20:57:29 +00:00
c039cb1a06 submodules: point gloo to new home in pytorch/ (#152438)
Gloo moved to the PyTorch GitHub org. This updates PyTorch to point to the new location.

https://github.com/pytorch/gloo

Test plan:

CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152438
Approved by: https://github.com/fduwjj
2025-04-29 20:42:24 +00:00
0a2d3206a8 [MPS][BE] Delete unused lerp functors (#152443)
For `lerp.Scalar_out` weight (aka alpha) is not an optional argument, so no point in having those specializations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152443
Approved by: https://github.com/Skylion007
2025-04-29 20:42:21 +00:00
1d8cdf373b [dynamo] Guard serialization for NAME_MATCH (#152332)
Differential Revision: [D73780430](https://our.internmc.facebook.com/intern/diff/D73780430/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152332
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327, #152328, #152329, #152330, #152331
2025-04-29 20:16:00 +00:00
5c297b2846 [dynamo] Guard serialization for DISPATCH_KEY_SET_MATCH (#152331)
Differential Revision: [D73780433](https://our.internmc.facebook.com/intern/diff/D73780433/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152331
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327, #152328, #152329, #152330
2025-04-29 20:16:00 +00:00
4cb75d7afc [dynamo] Guard serialization for ID_MATCH (#152330)
Differential Revision: [D73780431](https://our.internmc.facebook.com/intern/diff/D73780431/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152330
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327, #152328, #152329
2025-04-29 20:16:00 +00:00
0b39124ea3 [dynamo] Guard serialization for NONE_MATCH. (#152329)
Differential Revision: [D73780435](https://our.internmc.facebook.com/intern/diff/D73780435/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152329
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327, #152328
2025-04-29 20:16:00 +00:00
ab4091a9fa [dynamo] Guard serialization for BOOL_MATCH. (#152328)
Differential Revision: [D73780434](https://our.internmc.facebook.com/intern/diff/D73780434/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152328
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326, #152327
2025-04-29 20:16:00 +00:00
c521c45a8a [dynamo] Guard serialization for DICT_CONTAINS (#152327)
Adding serialization for DICT_CONTAINS

Differential Revision: [D73780432](https://our.internmc.facebook.com/intern/diff/D73780432/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152327
Approved by: https://github.com/jansel
ghstack dependencies: #152325, #152326
2025-04-29 20:16:00 +00:00
52202525b9 [dynamo] Guard serialization for DICT_VERSION (#152326)
I think we shouldn't support DICT_VERSION for 2 reasons:
1. dict version is not well defined across processes
2. they are pretty rare (only with pytree calls)

Differential Revision: [D73780437](https://our.internmc.facebook.com/intern/diff/D73780437/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152326
Approved by: https://github.com/jansel
ghstack dependencies: #152325
2025-04-29 20:16:00 +00:00
df663b9e72 [dynamo] Guard serialization for TYPE_MATCH (#152325)
Adding guard serialization for TYPE_MATCH

Differential Revision: [D73780438](https://our.internmc.facebook.com/intern/diff/D73780438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152325
Approved by: https://github.com/jansel
2025-04-29 20:16:00 +00:00
a04f4622e1 [conda] Remove conda from lint-autoformat.yml (#152433)
Installs setuptools since I get
https://github.com/pytorch/pytorch/actions/runs/14736804186/job/41364832984#step:5:60
```
+ python3 -m tools.generate_torch_version --is_debug=false
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/tools/generate_torch_version.py", line 9, in <module>
    from setuptools import distutils  # type: ignore[import]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'setuptools'
```
It should be a no op in the normal lint workflow since setuptools is in the docker image

Switched from using python3.10 to system python, which should be python3.9

Use venv to put deps not in the base?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152433
Approved by: https://github.com/huydhn
2025-04-29 20:14:21 +00:00
2cfc1faa27 [PT2]: fix add_passes and remove_passes naming issue (#152386)
Summary:
When defining pre_grad passes, they are initially defined as empty functions, then overriden in [customized_triton_kernel_passes.py](https://www.internalfb.com/code/fbsource/[b4eea3dcd7f22421e68a3c1533fd09a4281bc291]/fbcode/caffe2/torch/_inductor/fx_passes/fb/customized_triton_kernel_passes.py?lines=71-73). This causes issues for add_passes and remove_passes because `p.__name__` now may be prefixed by _.

This diff removes the leading _ to match the pass name.

Test Plan: Tested together with the next diff in the stack.

Reviewed By: oniononion36

Differential Revision: D73809937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152386
Approved by: https://github.com/huxintong
2025-04-29 20:07:15 +00:00
e58c73be44 Add latex settings (#152350)
- Fixes #147027
- Only lualatex can build our 3K pages PDF with reasonable quality, xelatex runs out of memory and pdflatex just fails.
- Move notes under the same toctree as python-api which is needed for the PDF but doesn't change how the HTML is generated.

This is the produced PDF:
[pytorch.pdf](https://github.com/user-attachments/files/19945450/pytorch.pdf)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152350
Approved by: https://github.com/albanD
2025-04-29 19:28:43 +00:00
e6e1ca1996 [easy] Fix test_dynamo_timed (#152387)
Summary: I'm just trying to fix the test again. It's out of date because it's disabled and some dynamo_timed-related fields are gone now.

Test Plan: `python test/dynamo/test_utils.py -k dynamo_timed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152387
Approved by: https://github.com/anijain2305
2025-04-29 19:22:56 +00:00
8e2e06b7ea Fix shadow local variables (#152429)
Summary: Fixing shadow local variables error: P1798875650

Test Plan: CI

Differential Revision: D73853605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152429
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-29 18:50:18 +00:00
a3123dd3ab Run link linters on modified files only or on everything when scheduled (#152377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152377
Approved by: https://github.com/huydhn
2025-04-29 18:30:40 +00:00
8303860de7 Add detailed triton kernel logging to tlparse (#152197)
This PR adds detailed logging of each triton kernel we compile, and its autotune result, to every kernel we compile with triton. We add these results to a global variable that we then clear after each triton kernel compile.

We can't keep these objects around after compile time, so we can't record the autotune cache save or coordinate descent tuning, unfortunately, but we can log at least:
- The duration of compilation
- Whether or not autotune cache hit
- The best autotuning config, if there's only one.

Example triton kernel info: https://gist.github.com/jamesjwu/493bdd0f36b0b7e3ca327f87bd6c2c75

See internal diff for an example log for internal model.

Differential Revision: [D73674443](https://our.internmc.facebook.com/intern/diff/D73674443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152197
Approved by: https://github.com/oulgen, https://github.com/eellison
2025-04-29 18:16:56 +00:00
d35e900c74 [MPSInductor] Make sure sizevars are computed (#152436)
Before calling the kernel

This fixes `GPUTests.test_float_repr_dynamic_shapes_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152436
Approved by: https://github.com/dcci
ghstack dependencies: #152363, #152430
2025-04-29 17:53:29 +00:00
835f95490f [MPSInductor] Fix type promotion in _print_Max (#152430)
Run into this problem while re-enabling `test_float_repr_dynamic_shapes`, where `_print_Max` were called for integer and long argument which resulted in the following compilation error
```
error: call to 'max' is ambiguous
        out_ptr0[x0 + x1*metal::max(1, ks0)] = static_cast<float>(tmp26);
                         ^~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/32023/Libraries/lib/clang/32023.619/include/metal/metal_integer:2477:16: note: candidate function
METAL_FUNC int max(int x, int y)
               ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/32023/Libraries/lib/clang/32023.619/include/metal/metal_integer:3686:17: note: candidate function
METAL_FUNC long max(long x, long y)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152430
Approved by: https://github.com/dcci
ghstack dependencies: #152363
2025-04-29 17:53:29 +00:00
cce8b5d8d7 Refactor TritonTemplate.generate and move codgen part to generate_and_load (#151764)
Splitting https://github.com/pytorch/pytorch/pull/149267/ .
This first PR just refactor the code without adding any caching functionality.
The logic of generating the code and loading it is moved to generate_and_load() + some typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151764
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-04-29 17:44:46 +00:00
3962b8f1e0 Revert "[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914)"
This reverts commit 64a55b531f4f4ae2b35175ab5d9a30a856b0d6ef.

Reverted https://github.com/pytorch/pytorch/pull/151914 on behalf of https://github.com/malfet due to Looks like breaks number of ROCM jobs, see 797768cd90/1 ([comment](https://github.com/pytorch/pytorch/pull/151914#issuecomment-2839691038))
2025-04-29 17:36:12 +00:00
797768cd90 [Graph Partition] reorder for minimal number of partitions (#151968)
This pr adds an optimal reordering for minimizing #partitions.

## Optimal reordering for minimizing #partitions

A bfs could minimize #partitions (ignore peak memory for now):
1. For each node, compute node_to_indegree: dict[node, int].
2. Maintain 2 queues: cudagraphable_nodes, and non_cudagraphable_nodes. Iterate through all nodes and add nodes to one of these 2 queues if node_to_indegree[node] == 0.
3. While non_cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0.
4. While cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0.
5. Repeat step 3 & 4 until all nodes have been scheduled.

We call this strategy `reorder_for_minimizing_partition`.

**Q: Why is this optimal?**

Suppose this is not optimal, we have a counter example with 2 non_cudagraphable regions:

```
[non_cudagrable1, cudagraphable2, non_cudagraphable3]
```

where we can reorder to only 1 non_cudagraphable region:

```
[non_cudagrable1, non_cudagraphable3, cudagraphable2]
```

This reorder means non_cudagraphable3 does not depend on cudagraphable2. So after we scheduled non_cudagraphable1, both non_cudagraphable3 and cudagraphable2 have in_degree as 0. If this is true, Step 3 should have already scheduled non_cudagraphable3 before cudagraphable2 such that the counter example cannot exist.

This shows we cannot find such a counter example and the bfs is optimal on minimizing #partitions.

## Minimize peak memory

`reorder_for_peak_memory` currently uses topological_sort_dfs, topological_sort_lpmf, and topological_sort_bfs, where the later 2 are bfs. ILP brings small benefits and it can hardly scale to more than 100 nodes, according to @xuanzhang816. So ILP is not used for peak memory reorder in the inductor.

Heuristics strategy:
- Conduct reorder_for_peak_memory as the default order
- Conduct reorder_for_minimal_partitions and get results as list[tuple[partition, bool]], where partition: list[BaseSchedulerNode] and bool for cudagraphable.
- If the reorder increases peak memory too much, we use the default order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151968
Approved by: https://github.com/eellison
2025-04-29 17:17:16 +00:00
a77a44761b [BE] Remove dangling # in contributing.md (#152259)
I frequently come to CONTRIBUTING.md to copy paste the below snippet to rebuild pytorch which in zsh gives this error because zsh interprets # as a command. These comments add nothing so just removing

```
error: pathspec 'sync' did not match any file(s) known to git
error: pathspec 'the' did not match any file(s) known to git
error: pathspec 'submodules' did not match any file(s) known to git
Building wheel torch-2.8.0a0+git9c01c87
invalid command name '#'
```

```
git submodule update --init --recursive # very important to sync the submodules
python setup.py develop                 # then try running the command again
git submodule update --init --recursive
python setup.py develop
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152259
Approved by: https://github.com/janeyx99
2025-04-29 17:07:19 +00:00
de20d76622 [conda] Remove conda usage from upload test stats while running workflow (#152431)
The original uses python 3.10 and the base is 3.9 but I think that's ok
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152431
Approved by: https://github.com/atalman
2025-04-29 16:16:54 +00:00
f84062f78d [conda] Remove conda usage from TD llm retriever job (#152338)
Remove conda usage from TD llm retriever job

python3 in the base is python3.9 right now.  I'm not sure what the best way to deal with a potentially different python version would be, dnf install?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152338
Approved by: https://github.com/huydhn
2025-04-29 15:17:50 +00:00
663bcb68ba Implement metal kernel for basic MPS arithmetic ops using TensorIterator (#147644)
Add metal kernels for add, subtract, & lerp ops using TensorIterator. Should help resolve: https://github.com/pytorch/pytorch/issues/143874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147644
Approved by: https://github.com/malfet
2025-04-29 14:24:49 +00:00
2fb62f8288 [Dynamo][Typing] Enable typing hints for tx in misc.py (#152412)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152412
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-04-29 13:54:35 +00:00
49cbe0ffe9 [AOTInductor] Propagate ConstantType for main graph. (#152272)
Summary:
We need to make sure all named_parameters and named_buffers be
propagated if we use runtime constant folding.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_constant_type_propagation

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152272
Approved by: https://github.com/22quinn

Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-29 12:42:17 +00:00
64a55b531f [OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914)
As the title stated.

**Changes**:
- Add get_rng_state & set_rng_state support for OpenReg
- Add _lazy_init support for OpenReg
- Remove redundant code for cuda/Module.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914
Approved by: https://github.com/albanD
2025-04-29 11:18:12 +00:00
5c01302cc8 Remove 3.13 hack when installing TIMM (#152399)
A Docker build failure showing up at this step triggered by the landing of https://github.com/pytorch/pytorch/pull/152362.  Here is the example logs https://github.com/pytorch/pytorch/actions/runs/14718029881/job/41305891896:

```
#37 29.72 + as_jenkins conda run -n py_3.13 pip install --progress-bar off --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124
#37 29.72 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/conda/envs/py_3.13/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 conda run -n py_3.13 pip install --progress-bar off --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124
#37 49.50 ERROR: Cannot install torch and torchvision==0.22.0.dev20250226+cu124 because these package versions have conflicting dependencies.
```

This happens because we have stopped building 12.4 nightly for sometime.  This hack doesn't apply anymore, so let's just remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152399
Approved by: https://github.com/cyyever, https://github.com/wdvr, https://github.com/malfet
2025-04-29 08:22:37 +00:00
eb69f4e609 Add lr_lambda type check in MultiplicativeLR (#151973)
Fixes #81554

## TestResult

### Before

```python
In [3]: import torch
   ...: class SimpleLinearModel(torch.nn.Module):
   ...:     def __init__(self):
   ...:         super(SimpleLinearModel, self).__init__()
   ...:         self.linear = torch.nn.Linear(10, 1)
   ...:
   ...:     def forward(self, x):
   ...:         return self.linear(x)
   ...:
   ...: net = SimpleLinearModel()
   ...: optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
   ...: scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, 0.95)
   ...: for i in range(10):
   ...:     print(i, scheduler.get_last_lr())
   ...:     scheduler.step()
TypeError: 'float' object is not callable

### After

```python
   ...: scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, 0.95)
TypeError: lr_lambda should be a function, but got float
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151973
Approved by: https://github.com/janeyx99
2025-04-29 08:21:41 +00:00
dcd9a444b3 Add pack support and use micro gemm for Half flex attention on CPU (#151530)
Add pack support and use micro gemm for the second gemm to improve the performance for Half flex attention on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151530
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-04-29 07:24:00 +00:00
cyy
41bd0c900a [1/N] Deprecate c10::string_view and at::string (#151972)
The calls of `c10::string_view` in the code base are replaced by `std::string_view`. The calls of `at::string` are replaced by `std::string`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151972
Approved by: https://github.com/malfet
2025-04-29 07:23:52 +00:00
a6d19fcfac Revert "[cudagraphs] Fix issue in collecting static_input_idxs (#152287)"
This reverts commit 75a564608ab289edd5ba0e30a3acf544b90b5769.

Reverted https://github.com/pytorch/pytorch/pull/152287 on behalf of https://github.com/wdvr due to causing ao failures - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152287#issuecomment-2837686127))
2025-04-29 06:57:06 +00:00
62f1d0ea78 Log information about suppressed data dependent errors (#151041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151041
Approved by: https://github.com/bobrenjc93
2025-04-29 06:08:07 +00:00
520366e102 Fix StringCoordView::substr after D73379178 / #151810 (#152304)
Received complaint that we broke something. After a bunch of debugging, landed on this test + fix.

Differential Revision: [D73754877](https://our.internmc.facebook.com/intern/diff/D73754877/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D73754877/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152304
Approved by: https://github.com/Skylion007
2025-04-29 06:00:38 +00:00
ad11d6378c Don't run NCCL/gloo distributed test without GPUs (#150764)
If there aren't any GPUs the WORLD_SIZE would be zero which does not work.
So skip those backends completely in that case.

Fix after https://github.com/pytorch/pytorch/pull/137161

It might make sense to still run the (CPU-) part of the tests by using something like `world_size = max(3, gpu_count)` or `num_gpus if num_gpus else 3` instead of skipping them all

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150764
Approved by: https://github.com/kwen2501
2025-04-29 05:27:23 +00:00
99c42722f6 [MPS] fix memory leak in sdpa float32 (#152371)
Fixes #152344

Leak seems to be on the MPS Graph side, even though there is an identity tensor it seems like it's no longer enough to bypass the SDPA sequence which seems to leak memory.

Even adding 0.0f seems to be optimized to be ignored and still take the sdpa sequence(that's the reason for adding 1e-20)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152371
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-29 04:51:10 +00:00
46419c7899 Revert "[Relandx2] Rewrite the guts of torch::jit::Lexer to speed it up (#152372)"
This reverts commit 7ce6f632142b65849fa33f325c90a24bace2c130.

Reverted https://github.com/pytorch/pytorch/pull/152372 on behalf of https://github.com/malfet due to Looks like it broke distributed this time around, see f05d3e5019/1 ([comment](https://github.com/pytorch/pytorch/pull/152372#issuecomment-2837426497))
2025-04-29 04:37:40 +00:00
f05d3e5019 [torch-xpu-ops] Update torch-xpu-ops commit pin. (#152321)
Update the torch-xpu-ops commit to [655fa9bc7f88ab5bd3766b5f2fd5b43989c2caca](655fa9bc7f), including:

- Fixes batch_norm numeric error by adding additional boundary check
- Enable two operators: fft & jagged_to_padded_dense
- XCCL relevant changes:
- Cache cclStream to improve performance.
- Add support for complex datatypes in allgather and broadcast.
- Support coalescing operations and batch_isend_irecv.
- Introduce additional logging; use export TORCH_CPP_LOG_LEVEL=INFO.
- Fix #152296
- Fix #152020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152321
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
2025-04-29 04:00:09 +00:00
119cdcc926 Add rich support to torch.distributed.tensor.debug.visualize_sharding (#152027)
Fixes https://github.com/pytorch/pytorch/issues/151857

Please verify this PR by running the following command on a computer with at least 4 GPUs.

```shell
torchrun --nproc_per_node=4 /w/pytorch/torch/distributed/tensor/examples/visualize_sharding_example.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152027
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2025-04-29 03:51:32 +00:00
9c7b902cb2 [MPSInductor][BE] Make all reductions cacheable (#152363)
By moving actual implementaiton to `_reduction_nocache` and make reduction a caching wrapper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152363
Approved by: https://github.com/dcci
2025-04-29 02:49:22 +00:00
5a9868b78c Do not log exception when recording is disabled or already recording (#151038)
I am not sure why do we log all exceptions here and re-raise them , but at least when recording is disabled this should be
transparent. namely logging dde could be spamming.

before:
<img width="995" alt="Screenshot 2025-04-10 at 12 47 31 PM" src="https://github.com/user-attachments/assets/f90d4557-d958-4558-a917-0d687366cad1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151038
Approved by: https://github.com/bobrenjc93
2025-04-29 02:48:20 +00:00
b22fda9e1c Remove conda refs in tools (#152368)
Fixes #152126

Did not find references in the two .ipynb files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152368
Approved by: https://github.com/atalman
2025-04-29 02:45:47 +00:00
c8b4a39d73 Add precedence to the infix printing done by sympy_str. (#151920)
Add precedence to the infix printing done by sympy_str.

Without this change sympy_str will print the same string for both `a+b*(c+d)` and `(a+b)*(c+d)`.

While there I also cleaned up the printing for `-a` and `a - b`.

Added some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151920
Approved by: https://github.com/jansel
2025-04-29 00:58:58 +00:00
4b61564252 Include CollectiveKernel in inductor debug visualization (#146561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146561
Approved by: https://github.com/eellison
ghstack dependencies: #152060
2025-04-29 00:53:38 +00:00
22f179d77d Use almalinux docker files for building Magma (#152358)
Resolves https://github.com/pytorch/pytorch/issues/151707 for CUDA Nvidia Magma builds.
Removes deprecated cuda 12.4 build.

Using `pytorch/manylinux2_28-builder` image for magma build creates circular dependency.

For a while for magma builds we used `conda-builder` image since it does not have circular dependency:
https://github.com/pytorch/builder/blob/release/2.4/magma/Makefile#L13
However during migration to pytorch/pytorch: https://github.com/pytorch/pytorch/pull/139888 we introduced circular dependency using Manylinux 2.28 docker image.

Hence using almalinux image which suppose to be general usage image

Please note: Magma builds using Docker build : https://github.com/pytorch/pytorch/blob/main/.ci/magma/README.md we can look into migrating them to Docker images if required as a followup BE change if needed

TODO: Make same change for rocm builds. I believe some more work for rocm is required, since maga-rocm is requires rocm dev, utils and lib to be installed : https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_rocm.sh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152358
Approved by: https://github.com/nWEIdia, https://github.com/huydhn
2025-04-29 00:45:01 +00:00
7ce6f63214 [Relandx2] Rewrite the guts of torch::jit::Lexer to speed it up (#152372)
Reapplying with fix for linux-manylinux-2_28-py3-cpu-s390x / build
failure
(https://github.com/pytorch/pytorch/actions/runs/14716285820/job/41300304223#logs),
which is to just update a pair of static_assert constants I got wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152372
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-04-28 23:55:48 +00:00
e5f4356a25 [inductor][fix] enable dtype promotion for bucketize (#150634)
Summary:
bucketization involves comparing an input with border values. Without careful consideration of dtypes, this can cause dangerous implicit casting.

aten.bucketize resolves this via dtype promotion. We enable dtype promotion for the inductor bucketization pass so as to maintain alignment with the aten op.

Test Plan:
```
python3 test/inductor/test_torchinductor.py -k "bucketize"
```

Fixes #145929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150634
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-04-28 23:44:26 +00:00
119f64d0eb Add 'step' counter to visualize_overlap log (#152060)
Example of log after the change:

```
[rank0]:V0227 15:07:20.704000 1594243 torch/_inductor/comms.py:621] [0/0] [__overlap] ==== Visualize overlap after reordering pass <function group_copy_collective at 0x7f41c1922050> (ran in 0.026380538940429688 sec)====
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      0: GroupedSchedulerNode(name='op6_op7')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      1: GroupedSchedulerNode(name='op55_op56')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      2: GroupedSchedulerNode(name='op75_op76')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      3: GroupedSchedulerNode(name='op121_op122')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      4: GroupedSchedulerNode(name='op141_op142')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      5: GroupedSchedulerNode(name='op187_op188')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      6: GroupedSchedulerNode(name='op207_op208')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      7: GroupedSchedulerNode(name='op253_op254')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      8: GroupedSchedulerNode(name='op273_op274')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap]      9: GroupedSchedulerNode(name='op319_op320')  (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152060
Approved by: https://github.com/eellison
2025-04-28 23:23:21 +00:00
a6d38051ee [CUDA][CUTLASS] CUTLASS 3.9 submodule upgrade (#151253)
Originally authored by Jack Kosaian, likely needs #ifdefs if we want to preserve compat with 3.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151253
Approved by: https://github.com/Skylion007, https://github.com/henrylhtsang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-28 23:10:14 +00:00
75a564608a [cudagraphs] Fix issue in collecting static_input_idxs (#152287)
related to https://github.com/pytorch/pytorch/issues/152275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287
Approved by: https://github.com/bdhirsh, https://github.com/eellison
2025-04-28 23:07:52 +00:00
63790a0c43 Speed-up time spent in generating shaped str keys (#152202)
Replaces the janky way of using the IntArrayRef to create an NSArray to ask for it to provide its contents in a string format with use of stringstream.

This speeds up the call for getting the key string for caching (or reading from cache) for shaped inputs by ~5x. While the actual wall time, depending on the number of input tensors, is only some microseconds this time represents non-negligible chunk of the overall time spent in preparing to dispatch work to the GPU. And since this function gets called on every time a (cacheable) operation in MPS is used it should be a small but broadly impacting time saver.

Using mps_linear as an example. Note this is before PR https://github.com/pytorch/pytorch/pull/152199 so it only captures the CPU time spent in the op call:

Before the change:
```
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x1108f07d0>
func(*args, **kwargs)
  Median: 22.75 us
  IQR:    0.87 us (22.50 to 23.38)
  8361 measurements, 1 runs per measurement, 1 thread
```

After the change:
```
torch.linear time: <torch.utils.benchmark.utils.common.Measurement object at 0x108875350>
func(*args, **kwargs)
  Median: 18.67 us
  IQR:    0.46 us (18.50 to 18.96)
  10342 measurements, 1 runs per measurement, 1 thread
```

Which aligns with the observed change for getTensorStringKeys() taking ~1us instead of ~5us  in mps_linear op I got from a point measurement sandwiching the function call with `std::chrono::high_resolution_clock`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152202
Approved by: https://github.com/Skylion007
2025-04-28 23:06:10 +00:00
c81d8c231c Fix CosineAnnealingWarmRestarts reset T_cur (#151289)
Fixes #88791

## Test Result

```python
pytest test/optim/test_lrscheduler.py -k test_CosineAnnealingWarmRestarts
```

![image](https://github.com/user-attachments/assets/75ad238c-f319-47dc-bf2d-da05b0879b84)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151289
Approved by: https://github.com/janeyx99
2025-04-28 23:02:55 +00:00
0d99b4e9e2 ROCm: Enable tf32 testing on test_nn (#148945)
Add tf32 support for ROCm tests.
test command: python test/test_nn.py -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148945
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-28 23:01:04 +00:00
f3ef46e5fa [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/iter.py (#151789)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/iter.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151789
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-04-28 22:56:39 +00:00
d79e06723d Provide list of files to link linters if desired (#152352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152352
Approved by: https://github.com/huydhn
2025-04-28 22:48:34 +00:00
c8540984a2 [inductor] set correct precompile start time (#152284)
Fixes #148777

With num_worker set to 1, ran script in #148777

before:
```
Precompiling benchmark choice TritonTemplateCaller took 0.19s
Precompiling benchmark choice TritonTemplateCaller took 0.38s
Precompiling benchmark choice TritonTemplateCaller took 0.53s
Precompiling benchmark choice TritonTemplateCaller took 0.90s
Precompiling benchmark choice TritonTemplateCaller took 1.29s
Precompiling benchmark choice TritonTemplateCaller took 20.78s
Precompiling benchmark choice TritonTemplateCaller took 25.42s
Precompiling benchmark choice TritonTemplateCaller took 25.92s
Precompiling benchmark choice TritonTemplateCaller took 27.21s
Precompiling benchmark choice TritonTemplateCaller took 48.76s
Precompiling benchmark choice TritonTemplateCaller took 53.66s
Precompiling benchmark choice TritonTemplateCaller took 63.12s
Precompiling benchmark choice TritonTemplateCaller took 69.53s
Precompiling benchmark choice TritonTemplateCaller took 71.24s
Precompiling benchmark choice TritonTemplateCaller took 75.57s
Precompiling benchmark choice TritonTemplateCaller took 97.58s
Precompiling benchmark choice TritonTemplateCaller took 107.71s
Precompiling benchmark choice TritonTemplateCaller took 117.27s
Precompiling benchmark choice TritonTemplateCaller took 126.30s
FX codegen and compilation took 133.733s
```

after:
```
Precompiling benchmark choice TritonTemplateCaller took 0.18s
Precompiling benchmark choice TritonTemplateCaller took 0.18s
Precompiling benchmark choice TritonTemplateCaller took 0.14s
Precompiling benchmark choice TritonTemplateCaller took 0.35s
Precompiling benchmark choice TritonTemplateCaller took 0.39s
Precompiling benchmark choice TritonTemplateCaller took 19.54s
Precompiling benchmark choice TritonTemplateCaller took 4.69s
Precompiling benchmark choice TritonTemplateCaller took 0.52s
Precompiling benchmark choice TritonTemplateCaller took 1.28s
Precompiling benchmark choice TritonTemplateCaller took 20.96s
Precompiling benchmark choice TritonTemplateCaller took 4.81s
Precompiling benchmark choice TritonTemplateCaller took 9.40s
Precompiling benchmark choice TritonTemplateCaller took 6.34s
Precompiling benchmark choice TritonTemplateCaller took 1.93s
Precompiling benchmark choice TritonTemplateCaller took 4.39s
Precompiling benchmark choice TritonTemplateCaller took 21.91s
Precompiling benchmark choice TritonTemplateCaller took 10.10s
Precompiling benchmark choice TritonTemplateCaller took 9.55s
Precompiling benchmark choice TritonTemplateCaller took 9.15s
FX codegen and compilation took 133.246s
```

Also tested async triton compile path by setting num_workers > 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152284
Approved by: https://github.com/Skylion007, https://github.com/henrylhtsang
2025-04-28 22:30:35 +00:00
e7c19f4f69 Revert "Reapply "Rewrite the guts of torch::jit::Lexer to speed it up (#151850)" (#152250)"
This reverts commit e407ea1e5e22a41d14ce141295bf391cd46f2677.

Reverted https://github.com/pytorch/pytorch/pull/152250 on behalf of https://github.com/malfet due to Breaks s390, may be time to move build back to opt-in 2667cb69d9/1 ([comment](https://github.com/pytorch/pytorch/pull/152250#issuecomment-2836833030))
2025-04-28 22:05:12 +00:00
2667cb69d9 [inductor] align replicationpad on processing bool dtype with eager (#147666)
Fixes #143779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147666
Approved by: https://github.com/jansel
2025-04-28 21:54:31 +00:00
86b0271b00 Add CUDA 12.8 almalinux image, remove CUDA 12.4 almalinux (#152362)
This is general purpose image located in: https://hub.docker.com/r/pytorch/almalinux-builder
Updating it to match our supported CUDA matrix

Adding this build to use as general purpose image and use for Magma build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152362
Approved by: https://github.com/malfet
2025-04-28 21:15:05 +00:00
eqy
34b0de50a3 [TF32][CUDA] account for TF32 in test_linear_autograd (#152216)
Abate some more noise seen on blackwell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152216
Approved by: https://github.com/Skylion007
2025-04-28 21:00:17 +00:00
ddff3d4f6b [inductor][invoke_subgraph] Run joint graph passes for inference (#152062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152062
Approved by: https://github.com/eellison
ghstack dependencies: #151409, #151633, #151477, #151957, #151961
2025-04-28 20:42:55 +00:00
99b6c426a9 [Graph Partition] fix extra reference in runner.partitions to cudagraphify functions (#152066)
When CompiledFxGraph is deallocated, its cudagraphifed fn (i.e., `current_callable`) is expected to also be deallocated.
Without graph partition, this is true since the cudagraphified fn is only refered by compiled_fx_graph.current_callable.

However, with graph partition, runner.partitions hold cudagraphified fns while compiled_fx_graph.current_callable holds the runner.call. Thus the cudagraphied fn may not be deallocated when CompiledFxGraph is deallocated. This leads to errors in several unit tests (e.g., test_unaligned_static_input_no_cudagraphs and test_unaligned_static_input_non_trees).

In this PR, we also clean up runner.partitions when CompiledFxGraph is deallocated. This fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152066
Approved by: https://github.com/eellison
2025-04-28 20:38:26 +00:00
728a6dd51c [Graph Partition] support ForeachKernelSchedulerNode (#152148)
ForeachKernelSchedulerNode misses outputs_by_name when created with previous nodes. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152148
Approved by: https://github.com/eellison
2025-04-28 20:38:22 +00:00
8e65310d49 [caffe2/c10/util/TypeIndex] Add '__CUDA_ARCH_LIST__' check (#152030)
Summary:
We suspect that switching the NVCC host compiler from GCC to Clang, while targeting multiple architectures, is causing issues because only _CUDA_ARCH_LIST_ is being passed, without _CUDA_ARCH_.

To resolve this c10 compilation error, we should first fix the problem and then switch the NVCC host compiler from GCC to Clang. Once this is done, the errors no longer occur.

Test Plan: CI

Reviewed By: zhuhan0

Differential Revision: D73383236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152030
Approved by: https://github.com/cyyever, https://github.com/ZainRizvi
2025-04-28 20:31:23 +00:00
fcebaedebc Add a label to skip URL lint if needed (#152340)
Some URLs may be down due to server side issues we can't control
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152340
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-28 20:29:40 +00:00
33766de2d3 [Security] Advise against loading untrusted TorchScripts (#152336)
As torchscripted model is a Turing complete program
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152336
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-04-28 20:18:56 +00:00
00ebbbb701 [cutlass backend] add addmm and bmm for cutlass backend benchmark (#152163)
Copying what @kadeng did.

```
FINAL results...

Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 44.454172253608704 |  3.0991086587309837  |         NA          |
|        triton         | 44.06978189945221  | 0.07496077567338943  | -0.8646890374284049 |
| triton_persistent_tma | 43.598245829343796 | 0.06154991965740919  | -1.9254130284597197 |
|  cutlass_lvl_default  | 39.91834074258804  | 0.056073310784995556 | -10.20338762612423  |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
|         name          | forward_time (us) | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+-------------------+----------------------+---------------------+
|         aten          | 49.05610531568527 |  0.160279156640172   |         NA          |
|        triton         | 43.97720843553543 |  0.0660805031657219  | -10.353241145961718 |
| triton_persistent_tma | 43.94153505563736 | 0.061738294549286366 | -10.425960697724962 |
|  cutlass_lvl_default  | 40.2066633105278  | 0.034127906896173954 | -18.039430460713596 |
+-----------------------+-------------------+----------------------+---------------------+

Average edge over aten (max(-edge, 0), higher is better):
triton: 5.608965091695062 (from 2 valid values)
triton_persistent_tma: 6.175686863092341 (from 2 valid values)
cutlass_lvl_default: 14.121409043418913 (from 2 valid values)
```

Differential Revision: [D73625766](https://our.internmc.facebook.com/intern/diff/D73625766/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152163
Approved by: https://github.com/jingsh
2025-04-28 20:16:17 +00:00
5f4c8e4c89 [inductor][tests] don't test for cpu if you want to use triton backend (#152227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152227
Approved by: https://github.com/clee2000
2025-04-28 19:43:56 +00:00
e407ea1e5e Reapply "Rewrite the guts of torch::jit::Lexer to speed it up (#151850)" (#152250)
Almost-exact reapply of #151850 (adding minor reviewer nits) . AFAICT it was reverted unnecessarily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152250
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-04-28 19:33:40 +00:00
6b1acfa41b Fix redistribute new_local_tensor be None case (#152303)
as titled, we can just set new_local_tensor to be the local tensor and
remove the None check, as there would be cases where there's no
transformation needed (i.e. src_placements and dst_placements are the same,
and we still want to return the original local_tensor)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152303
Approved by: https://github.com/awgu
2025-04-28 19:00:17 +00:00
d3f8aa4378 [ez] Don't always pass HF token to fsspec (#151464)
Summary: The HF storage reader/writer component can work for any back-end in theory, so we shouldn't enforce the token to be passed into fsspecreader/writer, because the specific fsspec implementation may not handle tokens. Specifically, manifold doesn't accept a token arg, but we're passing one in always, which is throwing

Test Plan: signals

Differential Revision: D73130679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151464
Approved by: https://github.com/Skylion007
2025-04-28 18:52:20 +00:00
41a0c23c7c Skip test requiring MKL (#152322)
`test_reproduce_121253_issue_addmm_fusion_check` checks for "mkl._mkl_linear" being found in the generated source which cannot be there when MKL isn't available.
Add skip marker similar to other tests in this file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152322
Approved by: https://github.com/Skylion007
2025-04-28 18:29:24 +00:00
686dff0098 Fix an incorrect link markup (#152239)
Remove extra whitespace so the link works correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152239
Approved by: https://github.com/soulitzer
2025-04-28 18:28:08 +00:00
fcbbb03d48 Extend vec backend with BF16 SVE intrinsics (#143666)
- Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`.
- Added bf16 detection in CMake
- Added a guard for native NEON code to prevent compilation errors

@aditew01 @maajidkhann please have a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666
Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/nikhil-arm

Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>
2025-04-28 18:25:44 +00:00
0c52ee1b35 [DTensor] Error on illegal view op during sharding prop (#149764)
Adds explicit error checking during sharding propagation for view ops
rather than relying on runtime errors during local op execution.

Before:
An error is thrown by aten.view op called by DTensor dispatch, because
the local shard size is incompatible with the (incorrectly calculated)
args to the view op.

`RuntimeError: shape '[384]' is invalid for input of size 512`

After:
We raise more specific errors for cases of incompatible view operations
during sharding propagation, before getting to runtime dispatch.

`RuntimeError: Attempted to flatten an unevenly sharded dimension, which would require resharding the input. Please explicitly redistribute the tensor instead.`

Change Summary:

add 'strict_view' kwarg to the helper methods that implement
view/reshape op shard prop rules, so it can be decided op-by-op whether
to raise these new errors
enabled errors just for the 'view' op in this PR
added two specific checks/errors that can occur during view ops.

Details:

- View ops are never allowed to flatten a dimension that is unevenly
  sharded, since that would likely change the size/content of the
  local_tensor and require redistribute
- View ops are also never allowed to flatten two dims if the rightmost
  dim is a Shard() placment, becuase it would cause contiguity errors
  without redistribution

Notes:

- Disables support for several ops in test_dtensor_ops.py test, which
  decompose to an illegal view that only works by performing a
  redistribution: cartesian_prod, flatten, ravel, reshape, reshape_as, view, view_as, take_along_dim, kron

Follow Ups:
- triage other view-like ops (besides aten::view) for using strict_view
- look for other gaps where view-like ops could still perform
  redistribution (ban them all, and document this)

Fixes #143372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149764
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #152045
2025-04-28 18:21:49 +00:00
efeed720a6 [DTensor] make test_dtensor_ops report dtensor_args (#152045)
Before:
Does not report DTensor args, and you can't tell which combination of
sharding/replication is used for that particular iteration

```
RuntimeError: failed to run: torch.flatten, with (*[tensor([[[-6.1074e-01,  1.1260e+00,  1.7686e+00, -7.8216e+
         [ 8.8558e-01, -3.0949e+00, -5.4584e+00, -8.5322e+00],
         [-2.9770e-01, -3.2814e+00, -7.5875e+00, -8.1269e+00],
         [-6.0136e+00, -5.1712e+00, -4.2667e+00, -4.2142e+00]],
        [[-7.5171e+00,  5.3900e+00, -7.9208e+00,  6.1000e+00],
         [-1.7350e+00, -3.6188e-03, -7.1592e+00,  9.2951e-02],
         [ 5.7143e+00, -3.0805e+00,  7.6227e+00, -7.4862e+00],
         [ 4.3167e-01, -4.9678e+00, -1.2441e+00, -2.3042e+00]],
        [[-7.4280e+00, -2.7754e+00, -5.2989e+00, -6.1920e+00],
         [-2.5225e+00, -5.2520e+00,  6.5686e+00, -6.0350e+00],
         [-5.1740e+00, -1.6405e+00, -4.4463e+00, -5.1884e+00],
         [ 3.9581e+00, -6.3151e-01, -3.3223e+00,  4.0546e+00]],
        [[-2.8112e+00,  3.8742e+00, -4.4612e+00, -5.0016e+00],
         [ 7.0568e+00, -2.0951e-01, -8.0049e+00, -4.1438e+00],
         [ 3.1207e+00, -7.6518e+00,  7.1084e+00, -1.0500e+00],
         [ 8.8823e+00, -1.1178e+00,  4.8485e+00, -8.8593e+00]]],
       requires_grad=True)], **{})
```

After:
You can see the particular DTensor spec that failed

```
RuntimeError: failed to run: torch.flatten, with (*[DTensor(local_tensor=tensor([[[-6.0136, -5.1712, -4.2667,
        [[ 0.4317, -4.9678, -1.2441, -2.3042]],
        [[ 3.9581, -0.6315, -3.3223,  4.0546]],
        [[ 8.8823, -1.1178,  4.8485, -8.8593]]], requires_grad=True),
        device_mesh=DeviceMesh('cpu', [0, 1, 2,3]), placements=(Shard(dim=1),))], **{})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152045
Approved by: https://github.com/XilunWu
2025-04-28 18:21:48 +00:00
bb90f66e70 [CUDA][conv3d] bump tolerances for test_variant_consistency_eager conv3d complex64 (#152203)
~1/1000 1.5e-5 mismatch on A100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152203
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2025-04-28 17:59:37 +00:00
79e8dc7d53 Pin to SHA for actions outside of PyTorch (#152110)
Pin actions from repos external to the PyTorch project to their shasums for security. This is a best practice as Git tags are not immutable.

https://openssf.org/blog/2024/08/12/mitigating-attack-vectors-in-github-workflows/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152110
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2025-04-28 17:57:32 +00:00
2246cb6e14 Fix common_distributed.py to NOT set root logger (#152319)
Using `logging.basicConfig` to set root logger's level is not a good behavior. Fix common_distributed.py to set level for current logger only, because it affects downstream's 3rd-party testing plugins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152319
Approved by: https://github.com/Skylion007
2025-04-28 17:51:32 +00:00
8ce3d4a541 test(Conv3d): use correct class for test_Conv3d_module_same_padding (#152187)
The test for the class `Conv3d` is calling `Conv2d`. This PR just ensure that we are testing the correct module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152187
Approved by: https://github.com/Skylion007
2025-04-28 16:59:12 +00:00
c869862875 Remove cuda dependencies from non cuda buids (#152333)
These dependancies added to fix poetry issue on pypi. However inclusion of these dependencies creates issue with poetry on download.pytorch.org due to poetry reading first available wheel on index for METADATA requirements. Hence all metadata requirements for CPU wheels can't list any cuda dependencies.

Injecting these dependencies via prep for pypi will need to be done via:
https://github.com/pytorch/test-infra/blob/main/release/pypi/prep_binary_for_pypi.sh

Ref: https://github.com/pytorch/pytorch/issues/152121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152333
Approved by: https://github.com/jeanschmidt, https://github.com/malfet
2025-04-28 16:46:44 +00:00
cbf8e0fb1a use statically known true instead of guard size oblivious in bmm and mm inductor decompositions . (#148893)
this was discussed with @eellison and he recommended using  statically_known_true here, the intuition is. We already have 0/1 specializations in place, if we reach those checks with dynamic shapes that are not already specialized
then we do not want them to specialize them, "a recompilation here is not justified".
Those are all non-semantic changing optimizations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148893
Approved by: https://github.com/eellison
2025-04-28 16:44:25 +00:00
6e5e9dc321 [benchmarking] Inc aarch64 bench shards to 15 (#152324)
As it frequently timing out with 12, but also it feels like shards are somewhat unbalanced
I.e. if one to look at https://github.com/pytorch/pytorch/actions/runs/14696840776/job/41239776679
Shard 12 takes 3.6 hours, while shard 11 is only 40 min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152324
Approved by: https://github.com/janeyx99, https://github.com/atalman
2025-04-28 16:08:39 +00:00
4bdecd94ea [modefile free][long tail] selectify fbcode/caffe2/defs.bzl (#148925)
Summary:
replace read_config with select

For more info, please refer to the [doc](https://docs.google.com/document/d/1e0Hvht8WEHhcRvlCAodq_R9xnAtKBrAhdyvxcAqQjCw/edit?tab=t.hl8j18gza0cv)

Test Plan: CI

Reviewed By: malfet

Differential Revision: D70267850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148925
Approved by: https://github.com/malfet
2025-04-28 16:04:28 +00:00
9c864f9b0f Revert "[Inductor UT] Generalize device-bias code in test_flex_attention.py (#151937)"
This reverts commit 443840080265ce6133121c91d258b619eae151bb.

Reverted https://github.com/pytorch/pytorch/pull/151937 on behalf of https://github.com/malfet due to Broke ASAN tests, probably by enabling too many tests https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=asan&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/151937#issuecomment-2835151532))
2025-04-28 12:56:49 +00:00
0b6ea0b959 [xla hash update] update the pinned xla hash (#151210)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151210
Approved by: https://github.com/pytorchbot
2025-04-28 11:45:09 +00:00
7cae7902a2 Add scripts to check xrefs and urls (#151844)
Traverses the docs and code to find any broken links
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151844
Approved by: https://github.com/huydhn
2025-04-28 09:30:07 +00:00
7e8b9b3f51 ReducedPrecisionFloatGemvFastPathKernel: Correctly type parallel_for lambda arguments as int64_t (#152233)
This plus the previous irangeification PR seem like a better fix for #150637 than #150949 to me -- should make sure we are using 64-bit math for indexing everywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152233
Approved by: https://github.com/Skylion007, https://github.com/cyyever
ghstack dependencies: #152232
2025-04-28 07:19:26 +00:00
3b7d6bbe8b irangeify ReducedPrecisionFloatGemvKernel.cpp (#152232)
We should be using irange, especially because we had 32-bit overflow issues in this file recently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152232
Approved by: https://github.com/Skylion007
2025-04-28 07:19:26 +00:00
ce00ec7ecf Enable max autotune for AOTInductor benchmark (#149309)
With this PR, AOTinductor can choose to run into max-autotune mode when benchmarking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149309
Approved by: https://github.com/desertfire

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
2025-04-28 06:54:26 +00:00
13966d0bf5 [BE] Migrate dtype_abbrs into one location (#152229)
Namely `torch.utils._dtype_abbrs.dtype_abbrs`

Before that it was defined in various forms of completeness in
c02edba863/torch/fx/graph.py (L215),
c02edba863/torch/testing/_internal/common_utils.py (L5226)
 and c02edba863/torch/testing/_internal/logging_tensor.py (L17)

TODO:
 - Add linter that `torch.testing._internal` module is not referenced from any of the public facing APIs, as it can have extra dependencies such as `expect_test`

Fixes https://github.com/pytorch/pytorch/issues/152225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152229
Approved by: https://github.com/clee2000, https://github.com/Skylion007
2025-04-28 03:52:47 +00:00
899eec665c [MPS] col2im kernel implementation (#152282)
Fixes #151820
Also requested in #141287

Mainly based on the cuda kernel implementations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152282
Approved by: https://github.com/malfet
2025-04-28 03:48:41 +00:00
2503843673 Add check for 2-dim mask to COO mask computation (#151940)
Follow up on discussion on https://github.com/pytorch/pytorch/pull/151794 Related to all fixes for https://github.com/pytorch/pytorch/issues/151351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151940
Approved by: https://github.com/Skylion007
2025-04-28 03:40:46 +00:00
4438400802 [Inductor UT] Generalize device-bias code in test_flex_attention.py (#151937)
@EikanWang @etaf @guangyey please take a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151937
Approved by: https://github.com/liangan1, https://github.com/drisspg
2025-04-28 03:07:23 +00:00
98bd2bd1ab Do not generate long log messages for suppressed data dependent errors. (#151023)
TORCH_LOGS="all" python test/test_dynamic_shapes.py -k test_guard_or_true

 before:
<img width="1065" alt="Screenshot 2025-04-10 at 9 55 27 AM" src="https://github.com/user-attachments/assets/3ee20de0-2902-4eb1-8ab0-80f1b974fb78" />

after:
<img width="1124" alt="Screenshot 2025-04-10 at 9 54 35 AM" src="https://github.com/user-attachments/assets/4e7e1f0c-856c-417f-8763-bfe183e2450d" />

Note: we actually do not expect to see a log at all, this is an orthogonal issue in recording where it logs each error seen
even when recording is not enabled? I will follow up with PR for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151023
Approved by: https://github.com/bobrenjc93
2025-04-28 00:39:52 +00:00
cyy
70d7638b0d Fix clang-tidy suppression in torch/csrc/jit (#152271)
Remove some clang-tidy suppression in torch/csrc/jit by applying fixes or refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152271
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-27 21:18:39 +00:00
c02edba863 Revert "Update OpenBLAS commit (#151547)"
This reverts commit c4b085475062270946eeec854aa54d0739c7a0c9.

Reverted https://github.com/pytorch/pytorch/pull/151547 on behalf of https://github.com/malfet due to It breaks all aarch64 tests ([comment](https://github.com/pytorch/pytorch/pull/151547#issuecomment-2833593427))
2025-04-27 18:58:35 +00:00
cyy
b34146a093 Fix initGdsBindings declaration (#152277)
Move initGdsBindings into the correct namespace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152277
Approved by: https://github.com/Skylion007
2025-04-27 17:04:56 +00:00
861945100e [Kineto] Enable OOM observer (#152160)
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.

In this diff, we want to implement the feature on MTIA side

Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt  kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```

As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}

Differential Revision: D71993315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
2025-04-27 15:56:44 +00:00
c4b0854750 Update OpenBLAS commit (#151547)
Motivation: Update OpenBLAS and change build script to enable SBGEMM kernels . Update pytorch `jammy` builds for aarch64 to use `install_openblas.sh` instead of `conda_install`

Link to full [TorchInductor Performance Dashboard AArch64](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2009%3A35%3A26%20GMT&stopTime=Thu%2C%2017%20Apr%202025%2009%3A35%3A26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=adi/update_openblas&lCommit=90701ab81bf61fd864d31e0aa7e88d97a1a8676c&rBranch=main&rCommit=40ce4fb24a536d175348df876f61956d4945778e)

1. This shows a promising speedup across most of the HF models in benchmark, specifically giving a significant boost to SDPA layers.
2. Overall torch-bench pass-rate increased `[87%, 65/75 → 96%, 72/75]`
<img width="676" alt="Screenshot 2025-04-17 at 10 32 10" src="https://github.com/user-attachments/assets/a92dce0c-ecee-4466-8175-065df664dd71" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151547
Approved by: https://github.com/malfet
2025-04-27 15:55:42 +00:00
bb680b5a87 [MPSInductor] Fix masked_fill decomp (#152268)
By adding `mps` to the list of accelerators that can work with CPU scalars

Fixes `GPUTests.test_masked_fill_promotion_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152268
Approved by: https://github.com/kulinseth, https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #152266
2025-04-27 15:50:46 +00:00
cbcf677223 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/lists.py (#151873)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/lists.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151873
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-04-27 11:59:45 +00:00
0423a7b322 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/nn_module.py (#151895)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/nn_module.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151895
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-04-27 11:54:42 +00:00
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
cbcc03c2ad [MPSInductor][BE] Only include headers when needed (#152266)
Store headers used by shader in `MetalKernel.headers`
Add headers when function depending on it gets invoked
Generate majority of a special ops from template
Delete two unused functors: `entr` and `xlog1py` as they are decomposed by inductor anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152266
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci, https://github.com/cyyever
2025-04-27 05:09:50 +00:00
a0d440a26a [AOTI][reland] Remove typedef for half and bfloat16 (#151109)
Summary: Reland https://github.com/pytorch/pytorch/pull/150657

typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen.

Differential Revision: [D72878456](https://our.internmc.facebook.com/intern/diff/D72878456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151109
Approved by: https://github.com/angelayi
2025-04-26 23:17:35 +00:00
225742838b Add an additional check to trigger graph break for sparse tensor (#151897)
Fixes #151522

This PR fixes the issue that Dynamo fails to trigger a graph break for sparse tensors in certain code paths. I added an additional check to handle this case, and it resolves the original problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151897
Approved by: https://github.com/jansel
2025-04-26 21:02:32 +00:00
e4a1a16bef Check integrity of bytes in AppendingByteSerializer (#152139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152139
Approved by: https://github.com/zou3519
2025-04-26 18:10:58 +00:00
9480ed4cd3 Fix typos in multiple files (#152254)
Fix typos in multiple files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152254
Approved by: https://github.com/Skylion007
2025-04-26 17:18:39 +00:00
6a62356857 [BE][Easy]: Change typing to DimsType in dim_reduction (#151677)
Use prims_common DimsType to reduce duplication of DType

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151677
Approved by: https://github.com/albanD
2025-04-26 16:59:32 +00:00
203201255f [dynamo] remove dead code for DATA_PTR_MATCH (#152206)
Summary: Seems this guard is not created anywhere

Test Plan: CI

Differential Revision: D73682084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152206
Approved by: https://github.com/anijain2305, https://github.com/jansel
2025-04-26 15:25:01 +00:00
ee8166e94f Correctly handle duplicated arguments when merging input views. (#146275)
Fix: #135099

This PR changes how we map the original inputs into the new set of
inputs that take in the tensor input's base instead of their aliases.

**Problem:** in order to create this mapping, we had a dictionary that
mapped the hashed arguments into their respective indices. However, if
there's a group of equal arguments, we will have only one mapping for
such an argument. This breaks the assumption that there will be one
mapping for each argument.

**Solution:** map the hashed arguments into a list of indices. Then, we
will be able to correctly reconstruct the parameters for the new calling
convention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146275
Approved by: https://github.com/bdhirsh
2025-04-26 14:50:16 +00:00
580913290c [Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226)
Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users.

The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it.

Repro with cuda:
```
>>> import torch
>>> event = torch.cuda.Event()
>>> event.cuda_event
0
>>> event.event_id
0
>>> event.record()
>>> event.cuda_event
127982096
>>> event.event_id
0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226
Approved by: https://github.com/albanD, https://github.com/guangyey
ghstack dependencies: #151404, #151221, #151411
2025-04-26 14:18:22 +00:00
2ce9d2e9aa [MPS/inductor] Adjust test_to_dtype_mps so that it works on the backend. (#152230)
float64 isnt' supported for MPS, but we can still test the functionality with another type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152230
Approved by: https://github.com/malfet, https://github.com/jansel
2025-04-26 13:54:53 +00:00
0f9b02c839 [Easy][torch.Event] Fix and improve the docs of torch.Event (#151411)
**Changes:**
- add detailed function or class signature
- fix the wrong display of torch.Event.wait and torch.Event.record
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151411
Approved by: https://github.com/albanD
ghstack dependencies: #151404, #151221
2025-04-26 13:52:38 +00:00
bd7dc1b17d [Easy] Fix the function signature of torch.Event (#151221)
As the title stated.

The difference between declaration and implemention.
declaration:
d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)

Implementation:
d5a19e4525/torch/csrc/Event.cpp (L30-L32)

**Question**: Which one should we choose?
- Change enable_timing to False to be consistent with torch.cuda.Event
- Change enable_timing to True to avoid BC-break
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221
Approved by: https://github.com/albanD
ghstack dependencies: #151404
2025-04-26 13:51:56 +00:00
4a46ee96d2 [Indcutor Remote Cache] Raise an exception if redis module is required but not available (#151779)
If we need redis but redis is not available, it is better to tell the user to install redis instead of continue silently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151779
Approved by: https://github.com/aorenste
2025-04-26 11:21:54 +00:00
8d427e9e76 [AOTInductor] Inherit Buffer if not being updated (#152092)
Summary: Inherit buffer from original constants buffer if it's not being updated.

Test Plan: TBD

Differential Revision: D73571260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152092
Approved by: https://github.com/kflu, https://github.com/jingsh
2025-04-26 04:28:23 +00:00
d22c4cc353 Add option to use mempool on OOM (#151487)
MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool.

Usage:
Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing.

```
pool = torch.cuda.MemPool(allocator, use_on_oom=True)
with torch.cuda.use_mem_pool(pool):
    a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
del a
# at the memory limit, this will succeed by using pool's memory in order to avoid the oom
b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
```

Testing:
```
python test/test_cuda.py -k test_mempool_limited_memory_with_allocator
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487
Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel
2025-04-26 04:04:57 +00:00
cyy
65b845f82b Remove useless options for third-party ONNX build (#147616)
Treat ONNX CMake targets properly and remove unneeded options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147616
Approved by: https://github.com/malfet
2025-04-26 02:34:08 +00:00
d9d306e8e9 Fix inductor test_linear_with_in_out_buffer (#151548)
Without MKL there is only 1 epilogue, not 2 because `addmm` is used instead of `packed_linear/_mkl_linear`.
This fails first at `TestSelectAlgorithmCPU.test_linear_with_in_out_buffer_batch_size_8_in_features_3_in_features2_192_image_size_224_out_features_64_bias_True_cpu_float32`

Instead of skipping the whole test just adjust the count for the single check.

Final numbers of `test/inductor/test_cpu_select_algorithm.py` without MKL:
```
Ran 1337 tests
OK (skipped=1211)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151548
Approved by: https://github.com/jansel
2025-04-26 01:53:34 +00:00
0e015ef116 [ROCm][Windows] Fix HIP Caffe2 Tests (#152014)
Solves the following problems of caffe2 HIP tests building on Windows:
1. HIP tests now use `hip_add_executable` to be built with custom_command invoking hip compiler, due to lack of cmake support for HIP in 3.18 (currently used).
2. failing with "Command line too long" which resulted from `hip_add_executable` adding the same flags over and over on top of `HIP_HIPCC_FLAGS` with every test added.
3. Disables `HasSameArgTypes` test on Windows, as `at::native::modern::detail` is nowhere to be found in the codebase (I think it must be a legacy thing). Perhaps the whole test should be removed/rewritten?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152014
Approved by: https://github.com/jeffdaily
2025-04-26 01:35:46 +00:00
3ef6d6924a [BE] Switch TestConsistency to MPS device (#147893)
Which will eventually allow move decorators away more `common_mps.py`

Adjust tolerances accordingly. XFAIL a bunch of tests on MacOS-13, which is going to be deprecated anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147893
Approved by: https://github.com/atalman
ghstack dependencies: #152204
2025-04-26 01:19:21 +00:00
73f11e3365 [BE] Do not allow PyTorch codebase to use c10::optional (#150464)
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature.
XPU still uses it, so exclude XPU builds  until https://github.com/intel/torch-xpu-ops/pull/1615 is merged

Test plan:
 - 0def9b4acc should fail MPS builds
 ```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
                                           c10::optional<int64_t> extra) {
                                           ^~~~~~~~~~~~~
                                           std::optional
```
 - a769759dd4 should fail CUDA builds
 ```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
        input, c10::nullopt, reduce_op, group_name, out);
                    ^

1 error detected in the compilation of
```

Fixes https://github.com/pytorch/pytorch/issues/150313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
2025-04-26 01:15:53 +00:00
4647658247 [PT2] - Allowlist should have precedence (#151942)
Summary: When working on List[List[int]], the ints were being considered Constants regardless of their inclusion on the allowlist.

Test Plan:
CI + new test

https://www.internalfb.com/intern/testinfra/testrun/5066549856504774

Differential Revision: D73137631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151942
Approved by: https://github.com/laithsakka
2025-04-26 00:58:43 +00:00
fa1b4ef649 Revert "Rewrite the guts of torch::jit::Lexer to speed it up (#151850)"
This reverts commit 47d34261e06e2416e7a1e7d51a3d428e4ea51f9d.

Reverted https://github.com/pytorch/pytorch/pull/151850 on behalf of https://github.com/ZainRizvi due to This codev PR is breaking  on it's internal counterpart diff D73129443.  For codev PRs like this one, please always make sure the internal diff is green and then land the diff internally. The Github PR will be automatically merged ([comment](https://github.com/pytorch/pytorch/pull/151850#issuecomment-2831686141))
2025-04-26 00:44:11 +00:00
47d34261e0 Rewrite the guts of torch::jit::Lexer to speed it up (#151850)
The trie-based approach was, apparently, not efficient. This incidentally fixes a bug where "not inp" and "is note" were lexed incorrectly; see test_lexer.cpp update.

Differential Revision: [D73129443](https://our.internmc.facebook.com/intern/diff/D73129443/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151850
Approved by: https://github.com/Skylion007
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810, #151849
2025-04-25 23:49:35 +00:00
0f765773e3 Revert "[BE] Do not allow PyTorch codebase to use c10::optional (#150464)"
This reverts commit 490ef768cff448080083a46f362053e025f6b95b.

Reverted https://github.com/pytorch/pytorch/pull/150464 on behalf of https://github.com/clee2000 due to broke xpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/14674243034/job/41187443432) [HUD commit link](490ef768cf)? ([comment](https://github.com/pytorch/pytorch/pull/150464#issuecomment-2831608162))
2025-04-25 23:34:56 +00:00
6aa92806db [CP] Use TorchFunctionMode to dispatch SDPA for CP (#147902)
While we prefer not use monkey patching to dispatch SDPA, TorchFunctionMode is currently not compatible with selective activation checkpointing (https://github.com/pytorch/pytorch/issues/147995). This PR adds `TorchFunctionMode` to CP code and make it configurable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147902
Approved by: https://github.com/XilunWu
2025-04-25 23:33:48 +00:00
e28864fc0f [MPS/inductor] Fix the approximation of polygamma for n == 0. (#152214)
Fixes #152205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152214
Approved by: https://github.com/malfet
2025-04-25 22:42:45 +00:00
cf101d66ee Add simple direct C++ tests for torch::jit::Lexer (#151849)
We have test_jit.py, but given that I'm working on
significant changes to the lexer, it seems nice to have direct C++
tests. (Also, writing the tests caught a pair of related bugs; see the
two tests with "Bug" in their name. The rewrite will fix them.)

Differential Revision: [D73402367](https://our.internmc.facebook.com/intern/diff/D73402367/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151849
Approved by: https://github.com/malfet
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810
2025-04-25 22:39:49 +00:00
490ef768cf [BE] Do not allow PyTorch codebase to use c10::optional (#150464)
Extensions can still rely on it, and we should decorate it with deprecated, but it is a C++20 feature

Test plan:
 - 0def9b4acc should fail MPS builds
 ```
/Users/ec2-user/runner/_work/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:975:44: error: no template named 'optional' in namespace 'c10'; did you mean 'std::optional'?
                                           c10::optional<int64_t> extra) {
                                           ^~~~~~~~~~~~~
                                           std::optional
```
 - a769759dd4 should fail CUDA builds
 ```
/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/CUDASymmetricMemoryOps.cu(530): error: namespace "c10" has no member "nullopt"
        input, c10::nullopt, reduce_op, group_name, out);
                    ^

1 error detected in the compilation of
```

Fixes https://github.com/pytorch/pytorch/issues/150313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150464
Approved by: https://github.com/atalman
2025-04-25 22:03:48 +00:00
9e50c21e27 Fix xrefs (#151888)
Fix existing cross references and removed old ones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151888
Approved by: https://github.com/eqy, https://github.com/huydhn, https://github.com/svekars
2025-04-25 21:27:27 +00:00
1aa971a3bb [ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)
This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm.

Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-25 21:06:45 +00:00
2c5c793085 [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)
As the title stated

**Changes:**
- Add **record**, **query** and **enable_timing** check
- Add related tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404
Approved by: https://github.com/albanD
2025-04-25 20:15:04 +00:00
91c590f048 [ONNX] add converters for sym_min, sym_max (#152196)
Conversion of Phi4-multimodel-instruct fails because of missing converters for torch.sym_max, and torch.sym_min.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152196
Approved by: https://github.com/justinchuby
2025-04-25 20:01:05 +00:00
9336608307 BM FM FlashAttention Test (#151974)
Reviewed By: joebos

Differential Revision: D72880307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151974
Approved by: https://github.com/yoyoyocmu, https://github.com/Skylion007, https://github.com/malfet
2025-04-25 19:24:25 +00:00
8542d55f0c [logging] Clean up dynamo_timed usages in cudagraph_trees (#152136)
Summary: I'm investigating differences in total torch.compile overhead in our two main internal sources: dynamo_compile and pt2_compile_events. One source of discrepancy is due to cudagraphs overheads. Currently, we have a context manager that optionally attributes a dynamo_timed region to a cudagraph-related column logged to dynamo_compile, but _all_ dynamo_timed regions show up in pt2_compile_events (hence the discrepancy; pt2_compile_events is overcounting). We could filter out these specific events from pt2_compile_events when measuring overall overhead. But I'm going to argue that those timed regions that we DO NOT consider as a compiler-related overhead don't have much value in logging in the first place. So I'm suggesting we just remove those instances.

Here's the production job with the discrepancy:
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/3604eypl
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/c2dv8sty

Test Plan:
torchbench nanogpt:
* tlparse: https://fburl.com/h1n2ascc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u37yrynp
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/s7avd0di

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152136
Approved by: https://github.com/BoyuanFeng
2025-04-25 19:18:12 +00:00
1bc0e2579d [aarch64] Fixes to build with ArmPL's cblas.h (#151126)
Summary:
Various fixes to make fbcode work w/ ArmPL's cblas header:
1) Avoid re-declaring prototypes for internal blas methods which ArmPL already declares.
2) Fix `std::complex` conversion when using these methods.
3)  Drop `extern "C"` around include fo `cblas.h`.

Test Plan: CI

Differential Revision: D72808561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151126
Approved by: https://github.com/Skylion007
2025-04-25 19:02:28 +00:00
56190d2577 [MPS] Fix ICE for entr bool instantiation on M1/M2 (#152204)
By instantiating it implicitly, otherwise attempts to run something like
```
% python3 -c "import torch; print(torch.special.entr(torch.testing.make_tensor(10, dtype=torch.bool, device='mps')))"
```
will fail with
```
Failed to created pipeline state object, error: Error Domain=AGXMetalG14X Code=3 "Compiler encountered an internal error"
```

Similar in spirit to https://github.com/pytorch/pytorch/pull/149123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152204
Approved by: https://github.com/dcci
2025-04-25 19:00:49 +00:00
d7eb3a492c [Typing] Enable torch.types.IntLikeType / FloatLikeType / BoolLikeType (#152157)
### Changes

Replace `Union[SymInt, int]` and `Union[int, SymInt]` with `IntLikeType`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152157
Approved by: https://github.com/Skylion007
2025-04-25 19:00:10 +00:00
85bfaf8cc5 Package const folded graph's cubin file (#152145)
Summary: We need to pacakge const folded graph's cubin file into the final .pt2 package.

Fix https://github.com/pytorch/pytorch/issues/152067

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_constant_folding_cuda
```

Differential Revision: D73626480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152145
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire
2025-04-25 18:38:32 +00:00
a5f2fd1017 Unskip index_put in cudagraphs (#152186)
The repro from the original skip in https://github.com/pytorch/pytorch/pull/105439 does not fail. unskip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152186
Approved by: https://github.com/Skylion007
2025-04-25 18:15:49 +00:00
bcf1031cb8 [ROCm] Fixes to enable VM-based MI300 CI runners (#152133)
New VM-based MI300 CI runners tested in https://github.com/pytorch/pytorch/pull/151708 exposed some issues in CI that this PR fixes:

* HSAKMT_DEBUG_LEVEL is a debug env var that was introduced to debug driver issues. However, in the new MI300 runners being tested, since they run inside a VM, the driver emits a debug message `Failed to map remapped mmio page on gpu_mem 0` when calling `rocminfo` or doing other GPU-related work. This results in multiple PyTorch unit tests failing when doing a string match on the stdout vs expected output.

* HSA_FORCE_FINE_GRAIN_PCIE was relevant for rccl performance improvement, but is not required now.

* amdsmi doesn't return metrics like [power_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-power-cap-info) and [clock_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-clock-info) in a VM ("Guest") environment. Return 0 as the default in cases where amdsmi returns "N/A"

* amdsmi throws an exception when calling `amdsmi.amdsmi_get_clock_info` on the VM-based runners. Temporarily skipping the unit test for MI300 until we find a resolution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152133
Approved by: https://github.com/jeffdaily
2025-04-25 18:06:48 +00:00
0dae27d75b Turn on static cuda launcher in OSS (#151691)
After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS.

Initial round of benchmarks look good, with average compilation time going down by a few percent:
<img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" />

With no changes to runtime perf:
<img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" />

There are a few noisy models I want to double check, though, so will run some more tests before accepting review.

Full benchmark results, showing a ~5% compile time improvement across the board:
https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405
<img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/EikanWang
2025-04-25 17:48:53 +00:00
c03359de2d Revert "[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#148981)"
This reverts commit fc6e37ceb23f99808265c11a37368078d5f982b8.

Reverted https://github.com/pytorch/pytorch/pull/148981 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @davidberard98 can you please help get these changes validated? Details in D73628297. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148981#issuecomment-2831044810))
2025-04-25 17:45:13 +00:00
4ea2e093ca [inductor][BE] Clean up use_mixed_mm and mixed_mm_choice usage inside pytorch (#152071)
Differential Revision: [D73551912](https://our.internmc.facebook.com/intern/diff/D73551912/)

Decided to leave the mixed_mm tests alive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152071
Approved by: https://github.com/eellison
2025-04-25 17:25:55 +00:00
67f75244ea Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)"
This reverts commit c91acad73a11825c366c51fb1e91d7e1a47d3f9e.

Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD can you please help it get relanded? To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2830829368))
2025-04-25 16:08:27 +00:00
d4a8e4e30c [dynamo] Guard serialization for HASATTR (#151349)
Adding guard serialization for type HASATTR

Differential Revision: [D73059073](https://our.internmc.facebook.com/intern/diff/D73059073/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151349
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #151318, #151343
2025-04-25 14:16:30 +00:00
558f45190e [dynamo] Guard serialization for NOT_PRESENT_IN_GENERIC_DICT (#151343)
Adding guard serialization for type NOT_PRESENT_IN_GENERIC_DICT

Differential Revision: [D73057304](https://our.internmc.facebook.com/intern/diff/D73057304/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151343
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #151318
2025-04-25 14:16:30 +00:00
a34c28e0d2 [dynamo] Add guard serialization for tensor matches. (#151318)
This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes.

The main behavioral change introduced in this diff is on CheckFunctionManager:

```
check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save")

guards_state: bytes = check_fn_manager.guards_state
```

Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later.

When we load back guards state, we will set `guards_serialization_mode` is set to `load`:

```
output_graph_state = pickle.loads(guards_state)
check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load")
```

# TENSOR_MATCH

Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards.

We kick off the work from TENSOR_MATCH from this diff.

# Testing

For each type of guard we will test it like the following:
1. Use guard_filter_fn to select 1 type of guard each time.
2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager)
3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager)
4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-04-25 14:16:23 +00:00
6e8602b558 Relax tolerance on test_aot_autograd_exhaustive_matmul_cpu_float32 without MKL (#152106)
When e.g. OpenBLAS is used instead of MKL the differences get to large:
> Greatest absolute difference: 5.91278076171875e-05 at index (7,) (up to 1e-05 allowed)
> Greatest relative difference: 3.468156592134619e-06 at index (7,) (up to 1.3e-06 allowed)

I traced some of the matmul operations and there are differences of around 8e-6 between MKL and OpenBLAS but I haven't found where exactly the backward pass is calculated which is where the actual differences arise. So I couldn't check if there is some difference in the low-level BLAS function used by the autograd.

However it seems odd that there is a difference at all: For the MKL case it seems to be zero up to the accuracy shown by Python.

So it seems the AOT compilation has some differences when MKL is not available.

Maybe this is also the reason why it fails for ARM and hence the test is skipped there. Maybe @zou3519 knows more as he introduced those skip markers in https://github.com/pytorch/pytorch/pull/85565

Is there any documentation how and where `matmul_backward(_out)` is generated and how AOT transforms it with and without MKL?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152106
Approved by: https://github.com/zou3519
2025-04-25 14:03:37 +00:00
c1c8c1f8d6 [Quant][X86] add an op to compute uint8 pointwise mul (#151112)
**Summary**
Add a new op, `onednn.qmul.tensor`, for int8 elementwise mul, which accepts inputs on CPU device (instead of QuantizedCPU).
The new op is implemented by AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.mul`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_mul_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151112
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-04-25 12:52:54 +00:00
ad81eeb7c7 Refactor to use torch.accelerator.device_index instead of torch.cuda.device for generic device context manager (#148880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148880
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #148864
2025-04-25 09:45:25 +00:00
33c75cae0a Add torch.accelerator.device_index as accelerator's device switch context (#148864)
# Motivation
We propose adding support for the Python with statement on `torch.accelerator.device_index` to enable device switching functionality. This enhancement would simplify writing device-agnostic code and provide benefits across all accelerators. Its device-specific counterparts include [`torch.cuda.device`](00199acdb8/torch/cuda/__init__.py (L482)) and  [`torch.cuda._DeviceGuard`](00199acdb8/torch/cuda/__init__.py (L469)).

**Design Philosophy**
It accepts either an `Int` or `None` as input. When `None` is passed, no device switch is performed. Supporting `None` is important for compatibility, as it's possible to encounter `None` values from `torch.device.index`.

Therefore, with this PR, we can do like this

```python
src = 0
dst = 1
# Set src to current device
torch.accelerator.set_device_index(src)
with torch.accelerator.device_index(dst):
    # Inside with statement, we set dst to current device
    assert torch.accelerator.get_device_index() == dst
# Here the current device should be src
assert torch.accelerator.get_device_index() == src
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148864
Approved by: https://github.com/albanD
2025-04-25 09:45:25 +00:00
f38dae76ee [Proposal] Drop legacy CUDA support to slim down the wheels (#152069)
Proposal of dropping legacy CUDA support to slim down the Windows wheels.

With the latest release of 2.7.0 and the new Blackwell support we've seen yet another rise in size to the wheel, going from ~2.5GB with Pytorch 2.6.0 all the way to ~3.1GB with pytorch 2.7.0 CUDA 12.8 on Python 3.12 and ~3.3GB with Python 3.13.

Python 3.12, Pytorch 2.7.0 Cuda 12.8
![image](https://github.com/user-attachments/assets/78a5bbcb-027e-4139-84f0-57bfae9f594e)

Python 3.13, Pytorch 2.7.0, Cuda 12.8
![image](https://github.com/user-attachments/assets/7f256860-46e3-41f6-81b3-65bd3ee5aa77)

These .CI changes should imply the removal of support for many GPUs which are now about 8 years old if not older, including GPUs like the GTX960M, 950M, 940M, 930M and some other Quadro GPUs all the way from april 2016 like Quadro M500M as per [Nvidia's Documentation](https://developer.nvidia.com/cuda-gpus).

This change would also save on our bandwidth 😅

@seemethere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152069
Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/atalman
2025-04-25 08:20:00 +00:00
a811d3351b [ONNX] Implement sym_not (#152111)
Implement onnx support for sym_not. Replaces https://github.com/pytorch/pytorch/pull/147472

Fix https://github.com/pytorch/pytorch/issues/136572
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152111
Approved by: https://github.com/titaiwangms
2025-04-25 07:50:37 +00:00
6120cc8ccd [executorch hash update] update the pinned executorch hash (#151728)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151728
Approved by: https://github.com/pytorchbot
2025-04-25 05:33:09 +00:00
a936d596f6 [Cutlass] Implement EVT example tensor creation (#150904)
This PR implements a translation layer from inductor IR to "example tensors" the expected arguments of the EVT tracer. These tensors basically store the name, shape, stride, and dtype of the tensor and allow an ast-based python parse to generate the EVT C++.

udpates to example tensor creation

Previously merged:
* https://github.com/pytorch/pytorch/pull/150903
* https://github.com/pytorch/pytorch/pull/150346
* https://github.com/pytorch/pytorch/pull/150345
* https://github.com/pytorch/pytorch/pull/150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150904
Approved by: https://github.com/eellison
2025-04-25 04:43:37 +00:00
dda0c952e7 [audio hash update] update the pinned audio hash (#152149)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152149
Approved by: https://github.com/pytorchbot
2025-04-25 04:20:06 +00:00
e2c7ae52d5 [ONNX] Add group_norm support from opset 21 (#152138)
I didn't run the model in test because ORT doesn't have the op yet. Nevertheless it should be leveraged for newer opset versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152138
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1, https://github.com/cyyever
2025-04-25 03:30:07 +00:00
1a6d50d407 Reducer: add check on received data to avoid segfault (#152143)
When ncclCommAbort is called it may return invalid/corrupted data to the reducer. This adds a check so we don't read past the end of the tensors leading to a segfault.

While this looks like it could be a security issue it actually isn't since we only read past the end of the buffer, not write.

Fixes #149418

Test plan:

https://gist.github.com/d4l3k/b47c2c95cf9c37e78069e19f1b6ed2c6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152143
Approved by: https://github.com/fduwjj, https://github.com/fegin
2025-04-25 02:16:44 +00:00
7f28c03fac Adding fbgemm to whitelist (#152079)
Adding `torch.ops.fbgemm` to GraphPickler's allowlist. Otherwise, the fx graph module containing `fbgemm` node will return "Unable to pickle non-standard op" error.

The validation is done on the model and the difference appears only on the graph name not the node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152079
Approved by: https://github.com/aorenste
2025-04-25 01:13:51 +00:00
8313bc27f2 Revert "Add OIDC permissions to bazel workflow (#151456)"
This reverts commit 5fc1eb85fc1b9d605939830d3be3506762b3df27.

Reverted https://github.com/pytorch/pytorch/pull/151456 on behalf of https://github.com/seemethere due to This is causing downstream failures on PRs, see examples in PR comment ([comment](https://github.com/pytorch/pytorch/pull/151456#issuecomment-2829130319))
2025-04-25 00:37:15 +00:00
75c71ab371 [Break XPU] generalize newly introduced device bias code in Inductor UT. (#151926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151926
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-04-25 00:03:23 +00:00
d70490ecfe [Inductor][CPP] Optimize the epilogue for int8 GEMM Template (#152000)
**Summary**
For int8 GEMM Template, the micro GEMM will calculate in u8s8s32 and we will do the scale/zp compensation in the epilogue. In general,  it will be calculated as:
```
temp = micro_gemm_output * x_scale * w_scale
temp = temp - (x_scale * w_scale * x_zp) * sum(w, 0)
```
For case when `x_scale, w_scale, x_zp` are constant, we can pre-calculate the compensation to save runtime calculation.

**Performance**
Test with 4 cores of XEON-5 and shapes from VIT model
Before
```
GEMM(M=197,N=768,K=768) compile: 0.0939 ms (2.48 TOPS, 18.13 GB/s)
GEMM(M=197,N=3072,K=768) compile: 0.4275 ms (2.17 TOPS, 13.90 GB/s)
GEMM(M=197,N=768,K=3072) compile: 0.2677 ms (3.47 TOPS, 22.20 GB/s)
GEMM(M=1,N=1000,K=768) compile: 0.0148 ms (0.10 TOPS, 99.10 GB/s)
```

After
```
GEMM(M=197,N=768,K=768) compile: 0.0597 ms (3.90 TOPS, 28.53 GB/s)
GEMM(M=197,N=3072,K=768) compile: 0.2126 ms (4.37 TOPS, 27.95 GB/s)
GEMM(M=197,N=768,K=3072) compile: 0.2282 ms (4.07 TOPS, 26.04 GB/s)
GEMM(M=1,N=1000,K=768) compile: 0.0149 ms (0.10 TOPS, 98.71 GB/s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152000
Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel
2025-04-24 23:36:00 +00:00
2089b22c76 [xpu] set aot device flags in cpp_extension (#149459)
If PyTorch is compiled with only AOT text strings starting with "dg2", the `_get_sycl_arch_list()` function will pass an empty string to `-device` argument of `ocloc` and then cause a compilation crash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149459
Approved by: https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2025-04-24 22:55:52 +00:00
fc6e37ceb2 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#148981)
This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/147019 :

Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key.

Motivation

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148981
Approved by: https://github.com/davidberard98
2025-04-24 21:28:53 +00:00
0413358a77 Non-deterministic alert in histc_cuda for floating types only (#151701)
The note about atomic add only applies for floating point. The
implementation is deterministic for integer data types.

fixes: #151610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151701
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-04-24 21:16:46 +00:00
6ced5e6840 Python 3.11 and 3.13 support for Windows Arm64 (#152109)
This PR adds Python 3.11 and 3.13 support Windows Arm64 wheels and creates the necessary jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152109
Approved by: https://github.com/malfet
2025-04-24 21:09:14 +00:00
eqy
d78d2af4e3 [CUDA][TF32] Account for TF32 in test_corrcoef (#151830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151830
Approved by: https://github.com/Skylion007
2025-04-24 21:06:07 +00:00
8a9c66bb70 Improve stable library apis per Scott's feedback (#152040)
Following 3 suggestions:
1. inline at::Tensor arg
2. use uniq ptr of array vs std::vector
3. document the `std::optional<S>()` case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152040
Approved by: https://github.com/swolchok, https://github.com/albanD
2025-04-24 20:51:03 +00:00
dccc41581a Include other accelerators in capturable docstr for optimizers (#149770)
Fixes #149722

@ILCSFNO is this better?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149770
Approved by: https://github.com/albanD
2025-04-24 20:38:42 +00:00
bd09d87fdb add Out Notes (#151306)
Fixes #150181
@albanD Could you please have a check?

Build locally without pytorch build:

![Developer-FAQ](https://github.com/user-attachments/assets/351a7e0b-588e-48ae-ad0a-03f427c86e89)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151306
Approved by: https://github.com/albanD
2025-04-24 20:25:09 +00:00
92f125e622 [export] improve error message for deserializing custom triton op (#152029)
In https://github.com/pytorch/pytorch/issues/151746, users ran into an error where a custom triton op cannot be resolved into an operator from string target. We improve the error message by reminding users to register the same custom operator at de-serialization time.

Now the error looks like this:
```python
torch._export.serde.serialize.SerializeError: We failed to resolve torch.ops.triton_kernel.add.default to an operator. If it's a custom op/custom triton op, this is usally because the custom op is not registered when deserializing. Please import the custom op to register it before deserializing. Otherwise, please file an issue on github. Unsupported target type for node Node(target='torch.ops.triton_kernel.add.default', inputs=[NamedArgument(name='x', arg=Argument(as_tensor=TensorArgument(name='linear')), kind=1), NamedArgument(name='y', arg=Argument(as_tensor=TensorArgument(name='mul')), kind=1)], outputs=[Argument(as_tensor=TensorArgument(name='add'))], metadata={'stack_trace': 'File "/data/users/yidi/pytorch/test.py", line 50, in forward\n    output = triton_add(dense_output, bias)', 'nn_module_stack': 'L__self__,,__main__.SimpleModel', 'torch_fn': 'add.default_1;OpOverload.add.default'}, is_hop_single_tensor_return=None): <class 'str'>.```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152029
Approved by: https://github.com/jingsh
2025-04-24 20:22:05 +00:00
24bda01a93 Pin theme to a branch (#152046)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152046
Approved by: https://github.com/albanD
2025-04-24 20:20:21 +00:00
eqy
6efc572221 [CUDA][CPU] Bump system memory requirement for test_cross_entropy_large_tensor (#151812)
`/usr/bin/time` seems to show max resident pages at 119GiB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151812
Approved by: https://github.com/colesbury
2025-04-24 19:25:29 +00:00
b1d055fd6a Revert "[dynamo] Add guard serialization for tensor matches. (#151318)"
This reverts commit 81c4369d813facf39313dfd481adc71704cbc2c1.

Reverted https://github.com/pytorch/pytorch/pull/151318 on behalf of https://github.com/zhxchen17 due to macos test failing ([comment](https://github.com/pytorch/pytorch/pull/151318#issuecomment-2828638168))
2025-04-24 19:22:45 +00:00
b11c9e1808 [CI][docker] Use install_cusparselt when possible in docker image (#150600)
spot checked builds for line like `Found CUSPARSELT: /usr/local/cuda/lib64/libcusparseLt.so`.  I don't know if there's another way to do it

I am slowly trying to reduce the duplicated code in docker image installs
Pros:
* less dup code

Cons:
* more docker copies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150600
Approved by: https://github.com/atalman
2025-04-24 18:52:10 +00:00
ff075d0815 Update docs dependencies for local build (#151796)
Fixes #151786

- Changed requirements.txt to a symlink to .ci/docker/requirements-docs.txt
- Updated README.md with better doc build instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151796
Approved by: https://github.com/malfet
2025-04-24 18:40:42 +00:00
81c4369d81 [dynamo] Add guard serialization for tensor matches. (#151318)
This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes.

The main behavioral change introduced in this diff is on CheckFunctionManager:

```
check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save")

guards_state: bytes = check_fn_manager.guards_state
```

Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later.

When we load back guards state, we will set `guards_serialization_mode` is set to `load`:

```
output_graph_state = pickle.loads(guards_state)
check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load")
```

# TENSOR_MATCH

Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards.

We kick off the work from TENSOR_MATCH from this diff.

# Testing

For each type of guard we will test it like the following:
1. Use guard_filter_fn to select 1 type of guard each time.
2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager)
3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager)
4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match.

Differential Revision: [D72987485](https://our.internmc.facebook.com/intern/diff/D72987485/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-04-24 18:07:01 +00:00
03970dfd4c Add functionality for installing free variables (#151134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151134
Approved by: https://github.com/anijain2305
ghstack dependencies: #152036
2025-04-24 17:57:54 +00:00
402d19c0bd add basic unit tests and noop config (#152036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152036
Approved by: https://github.com/anijain2305
2025-04-24 17:57:54 +00:00
9c1bc9ce46 [fake tensor] Cache None, integer and SymInts in the output (#151961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151961
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #151409, #151633, #151477, #151957
2025-04-24 16:44:45 +00:00
0eb554e96a Better error msg for too big to optimize (#151855)
Summary: In the "too big to optimize" error message, tell the user that they should use the torch._inductor.config.aot_inductor.compile_wrapper_opt_level = 'O0' flag

Test Plan:
This is not added to unit test cases because it runs for a little longer time before the expected failure

```

    def test_runtime_checks_error_msg(self):

        with torch.library._scoped_library("mylib", "FRAGMENT") as lib:
            torch.library.define(
                "mylib::foo",
                "(Tensor a, Tensor b) -> Tensor",
                tags=torch.Tag.pt2_compliant_tag,
                lib=lib,
            )

            torch.library.impl("mylib::foo", "cpu", lib=lib)
            def foo(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
                return a + b

            torch.library.impl_abstract("mylib::foo", lib=lib)
            def foo_fake_impl(a, b):
                return a + b

            class Model(torch.nn.Module):
                def __init__(self) -> None:
                    super().__init__()

                def forward(self, x):
                    for i in range(10000):
                        x = torch.ops.mylib.foo(x, x)
                    return x

            inputs = (torch.ones(8, 8, 8), )
            model = Model()
            with self.assertRaisesRegex(Exception, "torch._inductor.config.aot_inductor.compile_wrapper_opt_level"):
                with torch.no_grad():
                    AOTIRunnerUtil.compile(
                        model,
                        inputs,
                    )
```

Differential Revision: D72323380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151855
Approved by: https://github.com/desertfire
2025-04-24 16:35:19 +00:00
56e67badc3 Move verbose warning to warning_once (#152044)
It was printing 1000s of lines for me..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152044
Approved by: https://github.com/XilunWu
2025-04-24 16:18:34 +00:00
3a170a8ce6 Revert "[Cutlass] Implement EVT example tensor creation (#150904)"
This reverts commit 253059356fc93b51c7c53246a5922db3fb14e184.

Reverted https://github.com/pytorch/pytorch/pull/150904 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking the test_example_tensor_creation test internally. See D73519195 for more details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/150904#issuecomment-2828132914))
2025-04-24 16:00:25 +00:00
d743a7bd85 [invoke_subgraph] Cache fake tensor if no unbacked symint in the output (#151957)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151957
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #151409, #151633, #151477
2025-04-24 14:17:22 +00:00
1d73b644a8 [fake tensor cache] Support index with non bool/int8 indices (#151477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151477
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #151409, #151633
2025-04-24 13:48:18 +00:00
41285f26e4 [invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151633
Approved by: https://github.com/zou3519
ghstack dependencies: #151409
2025-04-24 13:32:08 +00:00
3278ddd50c [invoke_subgraph] Compile time traces (#151409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151409
Approved by: https://github.com/zou3519
2025-04-24 13:20:50 +00:00
5e320eea66 [BE] follow autoformating and linter (#151507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151507
Approved by: https://github.com/Skylion007
2025-04-24 07:37:04 +00:00
5b368fa0b7 Add torch.cuda._compile_kernel() (#151484)
Followup work on top https://github.com/pytorch/pytorch/pull/149480

Wrapper on top of nvrtc inspired by https://gist.github.com/malfet/2c9a25976dd7396430c38af603f791da from @malfet

Compiling toy kernels with this setup takes 0.01s vs 90s using `load_inline()` on my local H100. This was primarily motivated by the timeouts I was seeing in the popcorn leaderboard but would also be useful to integrate into KernelBench

This PR is in the same spirit as https://github.com/pytorch/pytorch/pull/148972 which was a similar UX for Metal

For now we are planning on landing this as a private function because we expect to iterate both on the user facing API and the internals implementation, will open up a seperate issue to discuss the path towards making this work public and give a broader overview of the state of custom cuda kernel authoring in PyTorch

Future work, as a prereq to making the work public
* divup primitive
* support multiple kernels
* Expose _get_nvrtc_version from native code
* interop with torch.compile
* AMD support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151484
Approved by: https://github.com/malfet
2025-04-24 07:14:31 +00:00
78953ee122 [pytorch] reland of [cutlass backend] delay construction of cutlass presets to when called (#151875) (#152031)
Differential Revision: D73524978

reland of https://github.com/pytorch/pytorch/pull/151875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152031
Approved by: https://github.com/yangw-dev
2025-04-24 05:36:36 +00:00
2ea8653391 [vec128] Fix fmsub NEON defintion (#152075)
As reported in https://github.com/pytorch/pytorch/issues/149292, according to manual, `vfmsq_f32` implements `c - a * b` rather than `a * b - c`, so it's call must be prefixed with `vnegq_f32`

Also, adjust the tests to use OpMath for FMA computation to avoid accuracy error accumulation due to non-fused multiply-and-add over lower precision dtypes

Note that `Vectorized::fmsub` is not currently instantiated anywhere, so it could safely remain broken

TODO:
 - Enable C++ testing on MacOS and/or aarch64 platforms (right now Mac tests are build without C++ tests)

Fixes https://github.com/pytorch/pytorch/issues/149292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152075
Approved by: https://github.com/swolchok
ghstack dependencies: #151955
2025-04-24 05:10:45 +00:00
5e9bdc9b86 [MPS] layernorm forward kernel (#152010)
Implements layernorm forward pass as a metal kernel instead of MPSGraph ops. Speed ups are indicated on the chart below:
![Figure_1](https://github.com/user-attachments/assets/27a4d2ef-b3e4-4650-9ce3-b939c080321e)

Script for generating times, need to build torch with old/new codebase and then run this with different file name indicated at the end of the script
```python
import csv
import time

import numpy as np

import torch
import torch.nn.functional as F

matrix_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
batch_sizes = [1]
elementwise_affine = [False, True]
num_runs = 50
warmup_runs = 3

def create_input_tensor(n, batch_size):
    torch.manual_seed(42)
    return torch.randn(batch_size, n, dtype=torch.float32)

def run_layer_norm(A, normalized_shape, elementwise_affine):
    torch.mps.synchronize()
    start = time.perf_counter()
    out = F.layer_norm(A, normalized_shape)
    torch.mps.synchronize()
    end = time.perf_counter()
    return out, end - start

results = {"N": [], "elementwise_affine": [], "batch_size": [], "mean_time": [], "std_time": []}

for el_aff in elementwise_affine:
    for n in matrix_sizes:
        for batch_size in batch_sizes:
            print(f"\nBenchmarking LayerNorm for input size N={n}, batch_size={batch_size}, elementwise_affine={el_aff}")

            try:
                A_cpu = create_input_tensor(n, batch_size)
                A_mps = A_cpu.to("mps")

                normalized_shape = (n,)

                for _ in range(warmup_runs):
                    _, _ = run_layer_norm(A_mps, normalized_shape, el_aff)

                times = []
                for _ in range(num_runs):
                    _, t = run_layer_norm(A_mps, normalized_shape, el_aff)
                    times.append(t)

                mean_time = np.mean(times)
                std_time = np.std(times)

                results["N"].append(n)
                results["elementwise_affine"].append(el_aff)
                results["batch_size"].append(batch_size)
                results["mean_time"].append(mean_time)
                results["std_time"].append(std_time)

                print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

            except RuntimeError as e:
                print(f"Error for N={n}, batch_size={batch_size}: {e}")
                continue

with open("layernorm_benchmark_times_new.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["N", "elementwise_affine", "batch_size", "mean_time", "std_time"])
    for i in range(len(results["N"])):
        writer.writerow(
            [
                results["N"][i],
                results["elementwise_affine"][i],
                results["batch_size"][i],
                results["mean_time"][i],
                results["std_time"][i],
            ]
        )

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152010
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-24 05:07:46 +00:00
a389835313 [MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152064
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-24 05:04:49 +00:00
2102b3b4c5 [FSDP1] print fqns when debug FlatParamHandle (#151336)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151336
Approved by: https://github.com/awgu, https://github.com/Skylion007
2025-04-24 04:49:24 +00:00
2a58d2a155 StringCordView: make iterator fast when there is only one piece (#151810)
This makes the StringCordView iterator a variant holding
either the existing implementation (when there is more than one piece)
or a simple `std::string_view::iterator` (when there is only one
piece). The latter seems to be significantly cheaper.

Differential Revision: [D73379178](https://our.internmc.facebook.com/intern/diff/D73379178/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151810
Approved by: https://github.com/Skylion007
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807
2025-04-24 04:43:34 +00:00
76cc379bec Fix missing moves in SchemaTypeParser::parseFakeAndRealType (#151807)
Was seeing a small amount of shared_ptr traffic from these.

The std::move(text) at the top is just a piggyback.

Differential Revision: [D73376720](https://our.internmc.facebook.com/intern/diff/D73376720/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151807
Approved by: https://github.com/zou3519, https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806
2025-04-24 04:43:34 +00:00
68454b9d17 Fix a missed c10::TypeFactory::create spot in function_schema_parser (#151806)
Looks like we are supposed to be using TypeFactory instead of direct creation everywhere that might run on mobile.

Differential Revision: [D73376716](https://our.internmc.facebook.com/intern/diff/D73376716/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151806
Approved by: https://github.com/Skylion007, https://github.com/iseeyuan
ghstack dependencies: #151801, #151802, #151803, #151804, #151805
2025-04-24 04:43:34 +00:00
b237211b42 Fix easy missing moves in function_schema_parser (#151805)
Just some straightforward not-moving-upon-return.

Differential Revision: [D73376718](https://our.internmc.facebook.com/intern/diff/D73376718/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151805
Approved by: https://github.com/malfet, https://github.com/cyyever
ghstack dependencies: #151801, #151802, #151803, #151804
2025-04-24 04:43:34 +00:00
89a85d0954 Add & use Token::text_view() (which returns a string_view unlike text()) (#151804)
Sadly, I can't just fix text() because that might cause lifetime issues in somebody's code.

Differential Revision: [D73376715](https://our.internmc.facebook.com/intern/diff/D73376715/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151804
Approved by: https://github.com/zou3519, https://github.com/cyyever, https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #151801, #151802, #151803
2025-04-24 04:43:34 +00:00
0559741d7f Fix return type of TypeFactoryBase<c10::DynamicType>::get (#151803)
getBaseType() actually returns a reference. This was causing shared_ptr copies.

Differential Revision: [D73376717](https://our.internmc.facebook.com/intern/diff/D73376717/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151803
Approved by: https://github.com/malfet, https://github.com/Skylion007
ghstack dependencies: #151801, #151802
2025-04-24 04:43:34 +00:00
fabbcddab1 Create and use DynamicTypes for check in DispatchKeyExtractor::makeBitsetForDispatchArgs (#151802)
On mobile, many but not all things in the JIT type subsystem start using DynamicType. Not using DynamicType  was imposing a startup time cost here, as explained in the comment.

Differential Revision: [D73129442](https://our.internmc.facebook.com/intern/diff/D73129442/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151802
Approved by: https://github.com/malfet
ghstack dependencies: #151801
2025-04-24 04:43:34 +00:00
5de92e676a Don't copy DynamicType argument to DynamicType::create (#151801)
This improves performance of DynamicType::isSubtypeOfExt.

Differential Revision: [D73129449](https://our.internmc.facebook.com/intern/diff/D73129449/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151801
Approved by: https://github.com/malfet
2025-04-24 04:43:34 +00:00
43f1b60ded Revert "[MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064)"
This reverts commit d703f062fe7e4ead362ec0473ef33579e84532ac.

Reverted https://github.com/pytorch/pytorch/pull/152064 on behalf of https://github.com/malfet due to Lint is not green ([comment](https://github.com/pytorch/pytorch/pull/152064#issuecomment-2826305781))
2025-04-24 04:04:49 +00:00
e2cf60ff18 [MPS] Fix test_neg_index_mps (#151966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151966
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-24 04:02:09 +00:00
2ee8de54b1 [dynamic shapes] user-code friendly statically_known_true, has_static_value (#151601)
Fixes #151480

Allows `statically_known_true` in user code, as well as introducing `has_static_value`, returning True if the input has a static bool/float/int value

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151601
Approved by: https://github.com/laithsakka, https://github.com/zou3519, https://github.com/jingsh
2025-04-24 02:53:59 +00:00
d703f062fe [MPS] Adjust test_sum_dtypes so it can run on MPS. (#152064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152064
Approved by: https://github.com/malfet, https://github.com/jansel
2025-04-24 02:32:36 +00:00
4ac2ee573d [sigmoid] memory planner C10 deps (#151275)
Summary: perf-sensitive util functions for use in our memory planner

Test Plan: CI

Differential Revision: D73002726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151275
Approved by: https://github.com/georgiaphillips
2025-04-24 01:46:32 +00:00
c91acad73a [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)
As the title stated

**Changes:**
- Add **record**, **query** and **enable_timing** check
- Add related tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404
Approved by: https://github.com/albanD
2025-04-24 01:28:09 +00:00
f39a1a43ee Fix typos in meta.rst (#151979)
### Fixes made:
- "allow you to the module" → corrected to "allows you to move the module"

- "allow" → changed to "allows" to agree with the singular subject "method"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151979
Approved by: https://github.com/colesbury
2025-04-24 01:25:09 +00:00
4e1d4333f7 [FlexAttention] Remove Old Constraint on lastdim strides (#151959)
Fixes: #148827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151959
Approved by: https://github.com/Chillee
ghstack dependencies: #151846
2025-04-24 01:09:52 +00:00
2455ded502 [FlexAttention] Fix device test instantation (#151846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151846
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/mlazos
2025-04-24 01:09:52 +00:00
f2cfeb23e5 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-04-24 01:06:29 +00:00
8172397025 Revert "Update torch-xpu-ops commit pin (#150827)"
This reverts commit 776aa682218bad4df7b6cd46ef2a0f1d8ca1194c.

Reverted https://github.com/pytorch/pytorch/pull/150827 on behalf of https://github.com/etaf due to Inductor UT regression ([comment](https://github.com/pytorch/pytorch/pull/150827#issuecomment-2825857903))
2025-04-24 00:41:06 +00:00
4d2d833976 [CI] Update sleef submodule to v3.8 (#151955)
Should help with RISC-V cross-compilation.
3.9.0 migration is blocked by sleef project switching to C++20
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151955
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/Skylion007
2025-04-23 23:56:05 +00:00
fd3d339e17 [dynamic shapes] be less aggressive with runtime assert CSE for bounds (#151590)
Fixes #150540
Fixes #147772

Stops trying to CSE bound expressions, only does exact deduplication for runtime asserts. Adds the test cases to check that AOTAutograd doesn't data-dependent error out when retracing due to not seeing the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151590
Approved by: https://github.com/laithsakka
2025-04-23 23:07:00 +00:00
47ad351ff3 [DRAFT] INitial version of sticky export (#151047)
Summary: This is to make torchnative demos and benchmarking real models more simple by not requiring ppl to find example inputs first.

Test Plan: CI

Differential Revision: D72815584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151047
Approved by: https://github.com/zhxchen17
2025-04-23 22:58:43 +00:00
bd191730ce [cutlass backend] Stop using GenerateSM80 for SM90 and SM100 (#150781)
Not urgent.

We don't use the GenerateSM80 ops I believe.

For SM100, we could skip SM90 as well. But I don't have data for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150781
Approved by: https://github.com/kadeng
2025-04-23 22:16:57 +00:00
dccb7a9cb2 [pytorch] use a mutex in initialize_torch_libraries (#151938)
Summary: The TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT feature is thread unsafe for calling the initializers, but we want to allow the deferred initializer call to be safe from multiple threads. Add a mutex to ensure we have thread safe construction of the libraries post launch.

Differential Revision: D73457714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151938
Approved by: https://github.com/swolchok, https://github.com/zou3519
2025-04-23 21:41:01 +00:00
562328501e Revert "Turn on static cuda launcher in OSS (#151691)"
This reverts commit e31e2d27c6739cad5327cc54e6ac9fd28a157cbf.

Reverted https://github.com/pytorch/pytorch/pull/151691 on behalf of https://github.com/malfet due to This breaks tests, see c1f51cf2c4/1 ([comment](https://github.com/pytorch/pytorch/pull/151691#issuecomment-2825427252))
2025-04-23 20:28:31 +00:00
98c53d8b39 Revert "[MPS] Fix test_neg_index_mps (#151966)"
This reverts commit 9422e24c472ccbaffc4cf3935e12d0a83f269560.

Reverted https://github.com/pytorch/pytorch/pull/151966 on behalf of https://github.com/malfet due to Looks like it broke halide testing, see https://github.com/pytorch/pytorch/actions/runs/14623941238/job/41034065229 ([comment](https://github.com/pytorch/pytorch/pull/151966#issuecomment-2825425305))
2025-04-23 20:25:49 +00:00
c1f51cf2c4 [map] defer importing AOTConfig and create_joint dependency (#151479)
Summary:
We reverted D72896450 due to a weird error happens at a seemingly unrelated test "buck2 run apf/data/tests:preproc_state_serializer_test -- --filter-text "test_load_artifact"
"

I did some investigation and found that moving import AOTConfig and create_joint inside the create_fw_bw_grap causes a delay of importing the recursively imported modules in AOTConfig create_joint from test construction time to the test running time. The path.exists mock gets called multiple times due to the inspect.getsource calls in multiple places of torch.

Specifically, we set a breakpoint at the sideeffect of mocked os.path.exists. P1787425831 shows the importing stack trace before the change. P1787431638 shows the importing stacktrace after the change.

The notable difference is that in the second pastry, we trigger an os.path.exists when somewhere in triton we called inspect.getsourcelines when we construct OnDiskPreprocStateSerializer, which gets recorded by the mock.

Looking at the test, it seems what the test actualy wants to test is the deserialize step. So we reset_mock before the step to avoid mocking things happened at import time.

Test Plan:
buck2 run apf/data/tests:preproc_state_serializer_test -- --filter-text "test_load_artifact"

and existing tests for map.

Differential Revision: D73138415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151479
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-04-23 19:16:40 +00:00
99ae7d4069 Reland fast gather and index implementation (#151917)
This PR reapplies #151490 and #151753 together, and adds some missing checks when applying the fast path.
Previously missed checks:
1) indexing path has the stride in the indexed dimension in bytes, gather path has the stride in the indexed dimension in elements. When checking if fast path is applicable, I didn't take this difference into account, and still multiplied the indexing stride by element size. Fixed and test added
2) We want to take fast path only when we are copying contiguous equally spaced slices of inputs + all the necessary alignment requirements. The effective tensor size should be 2d (after all possible flattening is applied), the index stride in the last dimension should be 0, and, since in the kernel we are not applying non-indexing-related offsets to src tensor, the src tensor stride in the second dimension should be 0. This automatically happens for gather with dim=0, so I didn't put in an explicit condition for this. Sometimes all conditions except first dim "effective" stride equal to 0 are satisfied for scatter on non-zero dim, when index size in the indexing dimension is 1 and thus it is collapsed (dimensions of size 1 are always collapsed), e.g.
```
        # test gather along 1st dim that can accidentally trigger fast path
        # because due to index dimension in the gather dim being 1
        # an unexpected squashing in tensorIterator happens
        src = make_tensor((16, 2, 16), device=device, dtype=dtype)
        ind = torch.randint(2, (16, 1), device=device).view(16, 1, 1).expand(16, 1, 16)
        res = torch.gather(src, dim=1, index=ind)
        if res.device.type == "cuda":
            ref_cpu = torch.gather(src.cpu(), dim=1, index=ind.cpu())
            self.assertEqual(res.cpu(), ref_cpu, atol=0, rtol=0)
```
Note that if index size here was (16, 2, 16) instead of (16, 1, 16) then the middle dimension could not be collapsed and we wouldn't end up incorrectly taking fast path.
We could update the kernel to take this stride into account when computing offsets into src tensor, or we could specifically disallow non-zero stride on the first dimension. I took the second path for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151917
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/Skylion007
2025-04-23 19:13:13 +00:00
69e41cee04 move find_hop_schema into _higher_order_ops/schema.py (#151147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151147
Approved by: https://github.com/zou3519
2025-04-23 18:26:37 +00:00
5acc3e286a [Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587)
Summary:
This PR introduces additional autotuning configurations for the persistent+TMA version of Triton `mm` and `addmm` operations. The new configurations are as follows:
* `(128, 128, 64, 5, 8)`
* `(256, 128, 64, 4, 8)`
* `(128, 128, 64, 5, 4)`

These configurations were selected based on exhaustive autotuning performed on commonly used shapes from an internal foundational model.

While these new configs are generally more performant across the board, we see notable gains a few specific cases:
* In scenarios where `n >> m, k`, the configurations `(128, 128, 64, 5, 8)` and `(256, 128, 64, 4, 8)` tend to produce an additional 5-10% speedup over the aten baseline compared to the original configurations.
* Similarly, the configuration `(128, 128, 64, 5, 4)` yields approximately an 8% improvement in scenarios where k >> m, n.

These enhancements are expected to provide performance benefits across diverse use cases, particularly when compared to the original set of configurations.

Test Plan:
contbuild & OSS CI

Reviewers: paulzhan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150587
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/eellison
2025-04-23 18:21:35 +00:00
3c1a17a08b [Dynamo] Use LazyVariableTracker in base VT (#151847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151847
Approved by: https://github.com/StrongerXi
2025-04-23 18:18:01 +00:00
aa285e6512 Revert "[cutlass backend] delay construction of cutlass presets to when called (#151875)"
This reverts commit 8ca7953d510deb21cd99b92523f73beafa4588bf.

Reverted https://github.com/pytorch/pytorch/pull/151875 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151875#issuecomment-2825030726))
2025-04-23 17:33:31 +00:00
5f63789dd2 [torchbind] fix error message when attr is a real tensor. (#151944)
Summary: Previously, when attr is defined, "if attr" will try to evaluate the data of attr, which is not intendended and we get a ugly error stack if the attr is not evaluable (like a fake tensor) before the callable(attr) check.

Test Plan: Existing tests.

Reviewed By: yushangdi, henryoier

Differential Revision: D73460905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151944
Approved by: https://github.com/yushangdi
2025-04-23 17:32:11 +00:00
9344da8bd1 Revert "[fake tensor cache] Support index with non bool/int8 indices (#151477)"
This reverts commit bdb34f55a0c44f82d914dc9b41e785b2eed97675.

Reverted https://github.com/pytorch/pytorch/pull/151477 on behalf of https://github.com/wdvr due to reverting confusing ghstack state ([comment](https://github.com/pytorch/pytorch/pull/151477#issuecomment-2825023953))
2025-04-23 17:30:27 +00:00
348272e67e Revert "[invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633)"
This reverts commit 02dd096e5154867f6eb463d434b9eba0bdc85a64.

Reverted https://github.com/pytorch/pytorch/pull/151633 on behalf of https://github.com/wdvr due to reverting confusing ghstack state ([comment](https://github.com/pytorch/pytorch/pull/151633#issuecomment-2825007363))
2025-04-23 17:23:23 +00:00
2ab752d720 Make torch.jit.Error inherit from Exception (#151947)
Summary:
I can confirm that `torch.jit.Error.mro()` contains `Exception` in the inheritance hierarchy.

This avoids a bunch of `pyre-ignore`s in D73352417.

Test Plan: Sandcastle

Differential Revision: D73464544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151947
Approved by: https://github.com/Skylion007
2025-04-23 17:19:25 +00:00
9422e24c47 [MPS] Fix test_neg_index_mps (#151966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151966
Approved by: https://github.com/malfet
2025-04-23 17:06:28 +00:00
a560216abb Update description for torch.random.fork_rng (#151881)
As the title stated.

Related ISSUE:
https://github.com/pytorch/pytorch/issues/151784
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151881
Approved by: https://github.com/albanD
2025-04-23 16:59:29 +00:00
05114679b7 [ROCm] AtomicAdd specialization on AMD for fp64. (#151724)
Fixes https://github.com/pytorch/pytorch/issues/151039

Improve scatter add performance on MI250X.

Some numbers from the reporter's benchmark:
```
Before: dtype torch.float64 time =  3.577979326248169
After: dtype torch.float64 time =  0.0031385421752929688
```
No perf. improvement to MI300 or MI100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151724
Approved by: https://github.com/jeffdaily
2025-04-23 16:33:32 +00:00
e31e2d27c6 Turn on static cuda launcher in OSS (#151691)
After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS.

Initial round of benchmarks look good, with average compilation time going down by a few percent:
<img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" />

With no changes to runtime perf:
<img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" />

There are a few noisy models I want to double check, though, so will run some more tests before accepting review.

Full benchmark results, showing a ~5% compile time improvement across the board:
https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405
<img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691
Approved by: https://github.com/oulgen
2025-04-23 15:43:24 +00:00
dcc32ff5bf [CUDA][cuBLAS][cuBLASLt] Opt-in unified cuBLAS + cuBLASLt workspaces (#151163)
opt-in version of https://github.com/pytorch/pytorch/pull/145130 as there was a lack of repro for the 70% forward issue
`TORCH_CUBLASLT_UNIFIED_WORKSPACE=1`

@izaitsevfb could you comment if it was repeatable per every forward pass, on startup, or something else?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151163
Approved by: https://github.com/ngimel
2025-04-23 15:24:22 +00:00
7310049c42 Revert "[FlexAttention] Fix device test instantation (#151846)"
This reverts commit b37fa20771a7aa1ddcfaf59df7e56683d3d0be3b.

Reverted https://github.com/pytorch/pytorch/pull/151846 on behalf of https://github.com/jithunnair-amd due to PR broke rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/151846#issuecomment-2824607429))
2025-04-23 15:01:36 +00:00
21b0ef520d [Easy] Remove redundant code (#151883)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151883
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-04-23 14:25:19 +00:00
b32b002a6e [BE] Replace std::runtime_error with TORCH_CHECK [1/N] (#151880)
Part of: #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151880
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/cyyever
2025-04-23 11:14:35 +00:00
6d28d61323 [CI] Remove protobuf from docker image (#151933)
Pretty sure the source should be the one in third-party

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151933
Approved by: https://github.com/huydhn
2025-04-23 10:29:09 +00:00
5b9df57b50 [dynamo] context manager/decorator for dynamo config patching during tracing (#150586)
Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code.

The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature.

Implementation:
- Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch
- Dynamo doesn't trace into the context manager but updates config at compile time
- Correctness is based on our correctness for handling supported context managers
- Implementation is inspired by how `GradModeVariable` is implemented.

Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach)

See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches.

NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305
2025-04-23 09:12:13 +00:00
62b5649b76 [Inductor] Test ND block pointers with dynamic shapes (#151646)
With ND tiling, we can get multi-dimensional block pointers with dynamic shapes. This is an important capability, but I couldn't find any CI tests for it. This PR adds a couple of tests checking that we get the expected block pointers with dynamic shapes, both for pointwise and reduction kernels.

Example kernels:
```
@triton.jit
def triton_poi_fused_div_0(in_ptr0, out_ptr0, ks0, ks1, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    yoffset = (tl.program_id(1) + tl.program_id(2) * tl.num_programs(1)) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[:, None]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, :]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[ks0, ks0], strides=[ks1, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1])
    tmp1 = (tmp0 / tmp0)
    tl.store(tl.make_block_ptr(out_ptr0, shape=[ks0, ks0], strides=[ks0, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp1, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])

@triton.jit
def triton_red_fused_prod_0(in_ptr0, out_ptr0, ks0, ks1, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr):
    xnumel = 1
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1)
    r0_base = tl.arange(0, R0_BLOCK)[None, :, None]
    r1_base = tl.arange(0, R1_BLOCK)[None, None, :]
    rbase = r1_base + r0_base*r1_numel
    block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[ks0, ks0], strides=[ks1, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[0, 0])
    _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 1, tl.float32)
    for r0_offset in range(0, r0_numel, R0_BLOCK):
        r0_index = r0_offset + r0_base
        r0_mask = r0_index < r0_numel
        for r1_offset in range(0, r1_numel, R1_BLOCK):
            r1_index = r1_offset + r1_base
            r1_mask = r1_index < r1_numel
            roffset = r1_offset + r0_offset*r1_numel
            rindex = r1_index + r0_index*r1_numel
            r0_0 = r0_index
            r1_1 = r1_index
            tmp0 = tl.load(block_ptr0, boundary_check=[0, 1], padding_option='zero', eviction_policy='evict_first')[None, :, :]
            tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
            tmp3 = _tmp2 * tmp1
            _tmp2 = tl.where(r0_mask & r1_mask, tmp3, _tmp2)
            block_ptr0 = tl.advance(block_ptr0, [0, R1_BLOCK])
        block_ptr0 = tl.advance(block_ptr0, [R0_BLOCK, (-1)*R1_BLOCK*(triton_helpers.div_floor_integer((-1) + ks0 + R1_BLOCK,  R1_BLOCK))])
    tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK])
    tmp2 = triton_helpers.prod(tmp4, 1)[:, None, None]
    tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp2, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151646
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/shunting314
2025-04-23 06:20:04 +00:00
ee81fe40c1 Support regexes in dynamic sources allowlist (#151766)
As requested by Shuai. I also included an additional refactor to capture
changes in the whitelist over time since previously the first time it
was set, it was impossible override when a new config was set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151766
Approved by: https://github.com/pianpwk
2025-04-23 06:17:16 +00:00
7c97720d16 [dynamic shapes] rewrite expand with guard_or_false (#150236)
Rewrites the expand decomposition to avoid unbacked errors, assuming the general path where `input shape == output shape or input shape == 1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150236
Approved by: https://github.com/laithsakka
2025-04-23 06:11:11 +00:00
097faa9217 [audio hash update] update the pinned audio hash (#151729)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151729
Approved by: https://github.com/pytorchbot, https://github.com/Skylion007
2025-04-23 06:04:32 +00:00
b247e5db33 [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AMX (#150603)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds AMX-based GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. It brings performance benefits on platforms where AMX is available.

**Validation results**
We have run GPT-J-6B and Llama-3-8B-Instruct on a 6th gen Xeon with 96 cores. Results show that the AMX-based microkernel outperforms AVX512-based one by >5x for prefill stage with 1024 input length.

**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150603
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-04-23 05:58:55 +00:00
54f736155b [dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)
For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export.
For infer_size: assumes if user passes us an unbacked, it's probably not -1

Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127
Approved by: https://github.com/laithsakka
2025-04-23 05:42:30 +00:00
b37fa20771 [FlexAttention] Fix device test instantation (#151846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151846
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng, https://github.com/mlazos
2025-04-23 05:37:25 +00:00
cc793e895e [StandaloneCompile] Autotune at compile time (#151922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151922
Approved by: https://github.com/jamesjwu
ghstack dependencies: #151921
2025-04-23 04:32:06 +00:00
f9bdfe90ae [MegaCache] Return None on no compilation (#151921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151921
Approved by: https://github.com/jamesjwu
2025-04-23 04:32:06 +00:00
78bbb468c6 Use /var/tmp instead of /tmp for torch cache directory on fbcode (#151466)
Summary:
We've been noticing that cache directory has been getting cleaned underneath us, lets use /var/tmp which is supposed to be cleaned less frequently.

https://fb.workplace.com/groups/257735836456307/posts/883428143887070

Test Plan: unit tests

Reviewed By: masnesral

Differential Revision: D73008663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151466
Approved by: https://github.com/masnesral
2025-04-23 03:30:51 +00:00
253059356f [Cutlass] Implement EVT example tensor creation (#150904)
This PR implements a translation layer from inductor IR to "example tensors" the expected arguments of the EVT tracer. These tensors basically store the name, shape, stride, and dtype of the tensor and allow an ast-based python parse to generate the EVT C++.

udpates to example tensor creation

Previously merged:
* https://github.com/pytorch/pytorch/pull/150903
* https://github.com/pytorch/pytorch/pull/150346
* https://github.com/pytorch/pytorch/pull/150345
* https://github.com/pytorch/pytorch/pull/150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150904
Approved by: https://github.com/eellison
2025-04-23 03:26:56 +00:00
cd021d048e Fix circular imports (#151939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151939
Approved by: https://github.com/jamesjwu
2025-04-23 02:53:32 +00:00
13339ce086 [dynamic shapes] bound_sympy for size-oblivious min/max reasoning (#151242)
Differential Revision: D72978020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151242
Approved by: https://github.com/bobrenjc93
2025-04-23 02:14:05 +00:00
74074fe8d8 [inductor] handle offset in ReinterpretView for alignment (#151859)
Fix https://github.com/pytorch/pytorch/issues/151589

It's interesting that the Q4_K dequantization example in the referred GH issue does not crash even if Inductor pass triton the wrong alignment information. I dig this a bit. The main reason is, there are 2 things in triton that decides the vectorization size
1. alignement
2. max number of contiguous elements a thread need to process

Here is the triton code that decides vectorization size [link](c5fed8e1ca/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp (L147-L157)), and here is the triton code that considers contiguity for vectorization [link](c5fed8e1ca/lib/Analysis/AxisInfo.cpp (L1250-L1269))

When Inductor wrongly tell triton that a unaligned tensor is aligned, Triton may not do vectorization (or not do full vectorization) because of the second restriction.

Check this test:
```
    @parametrize(
        "size",
        (
            128,
            1024,
            1024 * 1024,
        ),
    )
    def test_slice_view_dtype(self, size):
        offset = 1

        def f(x):
            return x[2:].view(dtype=torch.float32) + 1

        x = torch.randn((size + offset) * 2, dtype=torch.bfloat16, device=self.device)
        self.common(f, (x,), reference_in_float=False)
```

Before the fix, Inductor would tell Triton that the output of aten.view.dtype tensor is aligned even though it's not. That tensor will be passed to the triton kernel for the aten.add. Triton may do different vectorization decision depending on the tensor size
1. when size = 128, triton pick ld.global.b32 to load data from global memory
2. when size = 1024, triton uses ld.global.v2.b32
4. when size = 1024 * 1024, triton uses ld.global.v4.b32

So whether wrong alignment metadata causes issue depends on if triton picks the vectorized instructions. The latter depends on the triton config (block size) decided by inductor and triton internal logic (how they assign elements to each thread). We'd better to make sure Inductor always generate correct metadata to make sure such hidden issues does not turn into crash later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151859
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #151841
2025-04-23 01:50:49 +00:00
68a7501dab [Inductor][CPP] Fix Codegen Issue when Parallel Reduction under the vectorization (#151887)
**Summary**
Fixes [#151290](https://github.com/pytorch/pytorch/issues/151290) and [#151523](https://github.com/pytorch/pytorch/issues/151523), which are regressions introduced by [#144020](https://github.com/pytorch/pytorch/pull/144020). That PR enabled parallelization at the inner loop level.

However, a currently unsupported case arises when parallel reduction occurs under the vectorization loop level, specifically in patterns like:
```
for vec_loop_level:
    do_parallel_reduction
```
In such cases, a temporary buffer `tmp_acc_array` is allocated for tail scalar kernels, and another temporary buffer `tmp_acc_array` is also defined for parallel reduction. This results in a conflict due to overlapping temporary buffers. This PR disables the problematic case to avoid the conflict until proper support is implemented.

**Test Plan**
```
python test/inductor/test_flex_attention.py -k test_make_block_mask_cpu
python test/inductor/test_cpu_repro.py -k test_parallel_reduction_vectorization
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151887
Approved by: https://github.com/jansel
2025-04-23 00:41:14 +00:00
015b526a2a [MPSInductor] Warn-cast double as floats (#151963)
To support sqrt over dynamic shapes, i.e. make something like:
```python
torch.compile(dynamic=True)(lambda x: x * math.sqrt(x.size(0))
```
compilable into
```metal
// Source node to ATen node mapping:
// Graph fragment:
//   %scalar_tensor_default : [num_users=1] = call_function[target=torch.ops.aten.scalar_tensor.default](args = (%arg0_1,), kwargs = {})
//   %convert_element_type_default : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%scalar_tensor_default, torch.float64), kwargs = {})
//   %sqrt_default : [num_users=1] = call_function[target=torch.ops.aten.sqrt.default](args = (%convert_element_type_default,), kwargs = {})
//   %convert_element_type_default_1 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%sqrt_default, torch.float32), kwargs = {})
//   %mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %convert_element_type_default_1), kwargs = {})
 kernel void generated_kernel(
     device float* out_ptr0,
     constant float* in_ptr0,
     constant long& ks0,
     uint xindex [[thread_position_in_grid]]
 ) {
     int x0 = xindex;
     auto tmp0 = in_ptr0[x0];
     auto tmp1 = ks0;
     auto tmp2 = static_cast<float>(tmp1);
     auto tmp3 = metal::sqrt(tmp2);
     auto tmp4 = static_cast<float>(tmp3);
     auto tmp5 = tmp0 * tmp4;
     out_ptr0[x0] = static_cast<float>(tmp5);
 }
```

TODO:
 - Figure out if this could be tweaked in fx-passes, but overhead is probably too high

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151963
Approved by: https://github.com/dcci
ghstack dependencies: #151869, #151871, #151872
2025-04-23 00:30:45 +00:00
49b7ffbb15 [MPS] Implement _print_Trunc_to_Int (#151964)
Fixes `test_device_assert_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151964
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-23 00:30:00 +00:00
72f711e200 Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888)"
This reverts commit 8d81806211bc3c0ee6c2ef235017bacf1d775a85.

Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/henrylhtsang due to Revert because this change isn't needed ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2822768377))
2025-04-23 00:26:49 +00:00
334aab0dea Updates NCCLConfig with QOS variable (#151821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151821
Approved by: https://github.com/kwen2501
2025-04-23 00:03:49 +00:00
aa61707a56 Fix extra heap allocation in Source constructor (#151800)
This was a sneaky one: the StringCordView default constructor allocates.

Differential Revision: [D73129448](https://our.internmc.facebook.com/intern/diff/D73129448/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151800
Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #151682
2025-04-22 23:36:06 +00:00
cd576fdce5 [torch][fx] Add support for EXIR dialect overload ops in normalize_function (#143689)
Summary:
I had a minor annoyance when debugging graphs using EXIR dialect ops,
that all the function normalization went away. For functions with > 5 arguments,
some of which are just simple bools and ints, it's very helpful to have
the kwarg names attached.

Enhance `normalize_target` to handle EdgeOpOverload targets. To avoid
a circular dependency on Executorch from pytorch core, I just use a `hasattr`
check for "_op". This only happens if the target is not already a recognized
torch function.

Also, I noticed that the new `fx.Node.normalized_arguments` function
didn't forward an important kwarg to `normalize_target`, so I fixed that too.

Test Plan: Tested with FxGraphDrawer and an fx Graph containing EXIR nodes.

Differential Revision: D67545909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143689
Approved by: https://github.com/angelayi
2025-04-22 23:36:02 +00:00
4f8adde5ce Speed up OperatorEntry construction by avoiding updateDispatchTableFull_ (#151682)
The purpose of the updateDispatchTableFull_ call is, according to the comment, just to pick up fallback kernels if there are any. We can implement that directly more efficiently.

Differential Revision: [D73129447](https://our.internmc.facebook.com/intern/diff/D73129447/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151682
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh
2025-04-22 23:35:53 +00:00
c98340e268 [autodeps2] Replace third-party/pyyaml with third-party/pypi/pyyaml (#151668)
Summary: We should use the pypi version.

Test Plan: CI

Differential Revision: D73211869

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151668
Approved by: https://github.com/Skylion007
2025-04-22 23:27:13 +00:00
f4ac9a160d [fx] Filter stacktrace (#151029)
Filtering out the stacktrace so that the stacktrace on nodes when using fx.Tracer looks nicer. I just copied the filtering we have in [proxy_tensor.py](6720d23969/torch/fx/experimental/proxy_tensor.py (L1903-L1931)).

Previously the stacktrace looked like:
```
File "/data/users/angelayi/pytorch/moo.py", line 3964, in <module>
    run_tests()
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 1342, in run_tests
    unittest.main(argv=argv)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 101, in __init__
    self.runTests()
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/main.py", line 271, in runTests
    self.result = testRunner.run(self.test)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/runner.py", line 184, in run
    test(result)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run
    test(result)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/suite.py", line 122, in run
    test(result)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 650, in __call__
    return self.run(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3324, in run
    self._run_custom(
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3296, in _run_custom
    super_run(result=result)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/home/angelayi/.conda/envs/pytorch-3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 3156, in wrapper
    method(*args, **kwargs)
  File "/data/users/angelayi/pytorch/moo.py", line 1495, in test_stack_trace
    gm = torch.fx.GraphModule(m, tracer.trace(m))
  File "/data/users/angelayi/pytorch/torch/fx/_symbolic_trace.py", line 837, in trace
    (self.create_arg(fn(*args)),),
  File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward
    x = x * 2
  File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 716, in impl
    return tracer.create_proxy("call_function", target, args, kwargs)
  File "/data/users/angelayi/pytorch/torch/fx/proxy.py", line 248, in create_proxy
    proxy.node.stack_trace = "".join(CapturedTraceback.extract().format())
```
Now it looks like:
```
File "/data/users/angelayi/pytorch/moo.py", line 1485, in forward
    x = x * 2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151029
Approved by: https://github.com/jfix71, https://github.com/zou3519, https://github.com/jingsh
2025-04-22 22:50:36 +00:00
a7ccd96bbf logging start of torch elastic workers. (#150849)
Summary:
We would like to log start of the workers. It will help with complete logging.

Test Plan:
unit tests

https://www.internalfb.com/intern/testinfra/testrun/6473924724652056

e2e tests
https://www.internalfb.com/mlhub/pipelines/runs/mast/f712311762-27449483648-TrainingApplication_V403K?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION

Reviewed By: tnykiel

Differential Revision: D72297314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150849
Approved by: https://github.com/d4l3k, https://github.com/kiukchung
2025-04-22 22:35:06 +00:00
6a1b820255 [export] Enable symint inputs for AdditionalInputs and ShapesCollection (#151842)
With `AdditionalInputs`, the behavior is the same as with tensors:
```python
class M(torch.nn.Module):
    def forward(self, x, y):
        return x + y

additional_inputs = torch.export.AdditionalInputs()
additional_inputs.add((5, 5))
additional_inputs.add((3, 5))
additional_inputs.add((5, 4))
ep = torch.export.export(
    M(), (6, 7), dynamic_shapes=additional_inputs, strict=False
)
```

With `ShapesCollection`, we now need to wrap integer inputs as `_IntWrapper` so that we can have a unique identifier for each integer input.
```python
class M(torch.nn.Module):
    def forward(self, x, y):
        return x + y

from torch.export.dynamic_shapes import _IntWrapper

args = (_IntWrapper(5), _IntWrapper(5))
# Or we can do `args = pytree.tree_map_only(int, lambda a: _IntWrapper(a), orig_args)`
shapes_collection = torch.export.ShapesCollection()
shapes_collection[args[0]] = Dim.DYNAMIC
shapes_collection[args[1]] = Dim.DYNAMIC
ep = torch.export.export(
    M(), args, dynamic_shapes=shapes_collection, strict=False
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151842
Approved by: https://github.com/pianpwk
2025-04-22 22:29:18 +00:00
43de9b75c3 Remove mention of magma-cuda in readme.md, refactor magma_conda install (#147476)
Related to: https://github.com/pytorch/pytorch/issues/138506 we migrated magma-cuda build from anaconda to aws
Last version of magma-cuda published was 12.6 https://anaconda.org/pytorch/magma-cuda126

Here is the PR that moved from anaconda to tarball: https://github.com/pytorch/pytorch/pull/140417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147476
Approved by: https://github.com/albanD
2025-04-22 22:08:49 +00:00
c0b70f94e2 [Testing] Enable test_mutations_loop_fusion_mps (#151872)
By testing it against float32 rather than double dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151872
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #151869, #151871
2025-04-22 22:00:16 +00:00
2f851ac8f8 [MPSInductor] Implement atomic_add store mode (#151871)
Which fixes `GPUTests.test_index_put2_mps`, `GPUTests. test__unsafe_masked_index_put_accumulate_mps` and dozen of scatter/gather tests that relied on atomic_add store mode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151871
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #151869
2025-04-22 22:00:16 +00:00
3aecf2dc52 [MPS] Extend index_put to half precision floats (#151869)
By reusing `c10/metal/atomic.h`
This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support

TODOs:
 - Get rid of indexing kernel and compute it directly when kernel is run
 - Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch
 - Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-04-22 22:00:08 +00:00
b8f4dc5a9f [ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs (#146264)
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select.

scatter_add performance on MI300x:
dtype|Baseline (before optimizations)|opportunistic fastatomics
-------|----------------------------------|----------------------------------
f32|1.389425039|0.430447996
fp16|2.195472956|0.779729486
bf16|2.194051027|0.784599513

Using the following reproducer
```
import torch
import triton

def main():
    dtype = torch.float32
    dim = 1305301
    a = torch.rand(100, device="cuda", dtype=dtype)
    index = torch.randint(0, 100, (dim,), device="cuda")
    src = torch.rand(dim, device="cuda", dtype=dtype)

    print("=" * 20)
    print(
        triton.testing.do_bench(
            lambda: a.scatter_add(0, index, src),
            return_mode="median",
        )
    )
    print("=" * 20)

if __name__ == "__main__":
    main()
```

co-authored by: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264
Approved by: https://github.com/jeffdaily, https://github.com/mxz297

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-04-22 21:55:40 +00:00
e05ac9b794 Use folder tagged docker images for binary builds (#151706)
Should be the last part of https://github.com/pytorch/pytorch/pull/150558, except for maybe s390x stuff, which I'm still not sure what's going on there

For binary builds, do the thing like we do in CI where we tag each image with a hash of the .ci/docker folder to ensure a docker image built from that commit gets used.  Previously it would use imagename:arch-main, which could be a version of the image based on an older commit

After this, changing a docker image and then tagging with ciflow/binaries on the same PR should use the new docker images

Release and main builds should still pull from docker io

Cons:
* if someone rebuilds the image from main or a PR where the hash is the same (ex folder is unchanged, but retrigger docker build for some reason), the release would use that image instead of one built on the release branch
* spin wait for docker build to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151706
Approved by: https://github.com/atalman
2025-04-22 21:50:10 +00:00
017a6bd593 add min/max_seqlen to non_differentiable (#151750)
Fixes #148988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151750
Approved by: https://github.com/soulitzer
2025-04-22 21:46:02 +00:00
835413baed Revert "[Optimus][Observability] Improve tlparse logging (#151635)"
This reverts commit 06a3c3c8cdb2424d42d7926a49a18ee6852a40cb.

Reverted https://github.com/pytorch/pytorch/pull/151635 on behalf of https://github.com/clee2000 due to broke dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/14600342064/job/40970324075) [HUD commit link](06a3c3c8cd), test did fail on PR but dr ci says it matches an existing failure, which it does, but also this PR breaks the test too ([comment](https://github.com/pytorch/pytorch/pull/151635#issuecomment-2822538113))
2025-04-22 21:39:23 +00:00
bc6c0bc344 Revert "Do not generate long log messaged for suppressed data dependent errors. (#151023)"
This reverts commit dfdf731579d7472a009f8edf35994b8701e79065.

Reverted https://github.com/pytorch/pytorch/pull/151023 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))
2025-04-22 21:08:30 +00:00
459c62ee1d Revert "Do not log exception when recording is disabled or already recording (#151038)"
This reverts commit 73d95893a2b844ba8ee523e0e3915adf54017411.

Reverted https://github.com/pytorch/pytorch/pull/151038 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))
2025-04-22 21:08:30 +00:00
aaf71a481b Revert "Log information about suppressed data dependent errors (#151041)"
This reverts commit ccd00359da3423ff7bae8ee682df10590fc844ce.

Reverted https://github.com/pytorch/pytorch/pull/151041 on behalf of https://github.com/laithsakka due to breaking other PRs ([comment](https://github.com/pytorch/pytorch/pull/151023#issuecomment-2822483635))
2025-04-22 21:08:30 +00:00
2f74cffab2 Remove reinterpret_casts with undefined behavior from stable/library.h (#151595)
There is a list of valid uses of `reinterpret_cast` (see https://en.cppreference.com/w/cpp/language/reinterpret_cast), and the use here was not on the list, hence undefined behavior. Implement what we meant using memcpy, which is well-defined.

Differential Revision: [D73200791](https://our.internmc.facebook.com/intern/diff/D73200791/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151595
Approved by: https://github.com/janeyx99
2025-04-22 20:24:47 +00:00
3380a46b44 Fix DTensorTestBase to barrier with device ids (#150896)
try to get rid of the below annoying warnings when running the unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150896
Approved by: https://github.com/fegin
2025-04-22 20:22:55 +00:00
a48ccf02f9 [Inductor] move alignment tests to a separate file (#151841)
This is a pure code movement. test_torchinductor.py is already 15K lines of code. Move alignment related tests I added recently to a separate file. I need add more such kind of tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151841
Approved by: https://github.com/jansel, https://github.com/eellison
2025-04-22 20:18:58 +00:00
596296fb0b [standalone_compile] Dynamic shape handling (#151788)
standalone_compile needs to get dynamic shape information from
somewhere. We add a new `dynamic_shapes` argument with three options:

1. from the passed-in graph (dynamic="from_graph"). This is the default.
2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs")
3. from the current tracing context (dynamic="from_tracing_context")

1 and 3 are not exactly the same. 2 can also be used for more advanced
things... (specialize on one input but not the other).

Most of this PR is tests.

Test Plan:
- a lot of new tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788
Approved by: https://github.com/oulgen
2025-04-22 20:17:24 +00:00
7e4b89ac6c fix spammy library deinit errors when user passes an invalid TORCH_LOGS argument (#151678)
fixes https://github.com/pytorch/pytorch/issues/151055. Thanks @desertfire for the patch that fixed this.

I was a bit careful about the test - I wanted to make sure the test accurately ensures that we don't regress and our error message is not spammy when users enter an invalid `TORCH_LOGS=....` argument. But I tried to avoid using expecttests, since people  occasionally add new logging artifacts and I didn't want to add to much churn by forcing this to fail CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151678
Approved by: https://github.com/desertfire, https://github.com/zou3519
2025-04-22 20:13:52 +00:00
0bb9b89fb7 Revert "[compile][compile time traces] Add more dynamo traces (#151357)"
This reverts commit 607443b16be705788ab06e9a31e4569e0f1516c3.

Reverted https://github.com/pytorch/pytorch/pull/151357 on behalf of https://github.com/wdvr due to stack in a weird state - reverting for now ([comment](https://github.com/pytorch/pytorch/pull/151357#issuecomment-2822369232))
2025-04-22 20:12:44 +00:00
d0d4e992f1 [associative_scan] Fixes for assoc_scan testcases (#149988)
This PR fixes some issues with the testcases of `associative_scan`, in particular the problem where the compile_mode is inadvertently always set to `none`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149988
Approved by: https://github.com/ydwu4
2025-04-22 20:09:12 +00:00
8ca7953d51 [cutlass backend] delay construction of cutlass presets to when called (#151875)
In hindsight, always constructing the dict is a bit silly. We should only construct it when we need it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151875
Approved by: https://github.com/yangw-dev
2025-04-22 20:03:10 +00:00
6cd1741985 [ONNX] Update decomposition logic to loop over onnx registry (#151826)
Fixes #150367

This PR makes decomposition table from onnx registry, which includes registered ops not only ATen and prim. This will help to keep the custom ops that are specified in the custom_translation table from decomposition during ONNX export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151826
Approved by: https://github.com/justinchuby
2025-04-22 19:40:52 +00:00
69ee6a9280 [Sana][HybridCache] Fix bug in detect_attr_assignment (#151824)
Summary: tree_flatten_with_map will internally call unflatten function with user supplied function. But this function was not returning anything causing the leaves to be None. This is wrong when the constructor is sensitive to this behaviour

Test Plan: CI

Differential Revision: D73388529

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151824
Approved by: https://github.com/bdhirsh
2025-04-22 19:39:50 +00:00
337caacd4c Use more efficient mask to index computation (#151372)
This change addresses the third time/mem "spike" observed in

https://github.com/pytorch/pytorch/issues/151351

The change sees to perform better (time/mem) for both very sparse and very dense cases. It runs faster, and claims less memory both observed on CPU/GPU. It even avoids OOM for larger cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151372
Approved by: https://github.com/eqy
2025-04-22 19:31:12 +00:00
fbd29527d8 [MPS] Move ops modifiers to testing utils so other tests can reuse (#151781)
Test collection check:
```
python -m pytest test/test_mps.py --collect-only
```
Before:
```
6390 tests collected in 8.34s
```

After:
```
6390 tests collected in 7.71s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151781
Approved by: https://github.com/malfet
2025-04-22 19:19:52 +00:00
982062dfc4 Cache the value of torch_key in subproc (#151057)
No need to recalculate torch_key in subprocs, lets pass it from main process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151057
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-04-22 18:54:06 +00:00
fa0f13b90b Fix doc requirements install error (#151787)
Fixes #151786

Change version in requirements of docs consistent with version in [CI version file](https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-docs.txt), which changed in #149331

### Test Result

![image](https://github.com/user-attachments/assets/f8646c03-116f-4f1c-b017-11b70995626b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151787
Approved by: https://github.com/malfet
2025-04-22 18:33:44 +00:00
4bf09562e4 [EZ/Profiler] Update Submodule (#151843)
Summary: Update to d82680bbd4

Test Plan: CI

Differential Revision: D73397323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151843
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2025-04-22 18:19:43 +00:00
834a017fe3 Optimize register_full_backward_hook description when all input no grad (#151785)
Fixes #100528

## Test Result

### Before

![image](https://github.com/user-attachments/assets/5dd2e1d3-3bb1-49d0-84bf-8a7a6b18fa4b)

### After

![image](https://github.com/user-attachments/assets/2e16d17b-1586-40d8-b0ef-35559fc064f4)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151785
Approved by: https://github.com/soulitzer
2025-04-22 17:57:31 +00:00
2c27597d6a Infra for handling builtin ops (min, max, math.pow) (#151348)
Reapply of https://github.com/pytorch/pytorch/pull/150003

Differential Revision: [D73050801](https://our.internmc.facebook.com/intern/diff/D73050801/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151348
Approved by: https://github.com/zhxchen17
ghstack dependencies: #151347
2025-04-22 17:20:09 +00:00
264e8fb151 More fix for aot_export_module name collision during unlifting (#151684)
Summary: Also check the module's named buffers and parameters when resolving name collision

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```

Differential Revision: D73264885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151684
Approved by: https://github.com/angelayi
2025-04-22 16:59:33 +00:00
06a3c3c8cd [Optimus][Observability] Improve tlparse logging (#151635)
Summary: We improve tlparse logging for Optimus graph transformaton to enable easier debug

Test Plan:
```
TORCH_TRACE=~/my_trace_log_dir CUDA_VISIBLE_DEVICES=5 buck2 run mode/opt //aps_models/ads/ecosystem/tooling/tools/efficient_module_suite/pyper_models:pyper_model_perf_benchmark -- --flow_id 720055919 --shrink_model --mfu_profile_module "impl.shared_arch.dense_sparse_interaction" --use_synthetic_data
```

Differential Revision: D73229681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151635
Approved by: https://github.com/Yuzhen11
2025-04-22 16:56:08 +00:00
5fc1eb85fc Add OIDC permissions to bazel workflow (#151456)
Update workflow to use OIDC authentication to access AWS resources rather than assuming the runner's default role. This is part of the multicloud effort to prepare jobs to support being run in non-AWS clouds.

The JWT ID token requires `id-token: write` in order to create the token for the job. See: https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#adding-permissions-settings

Ref: pytorch-fdn/multicloud-ci-infra#3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151456
Approved by: https://github.com/malfet
2025-04-22 16:54:14 +00:00
5d316ce0d0 Add device check for inputs (#151828)
Summary: Generate device checks for inputs in AOTI. Enable with AOTI_RUNTIME_CHECK_INPUTS=1

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_runtime_checks_device_type_failed
```

Differential Revision: D73382824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151828
Approved by: https://github.com/angelayi
2025-04-22 16:36:27 +00:00
3804aed32e Revert "[Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587)"
This reverts commit 99aeee2c5f07f7fe6ec3f34aacb7db71569a60c5.

Reverted https://github.com/pytorch/pytorch/pull/150587 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally (see D73410693). To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/150587#issuecomment-2821828926))
2025-04-22 16:15:55 +00:00
4504910843 Revert "[ez] Make relaxed constraint error message more user friendly (#151407)"
This reverts commit e0f05229e9ff84aa6138df2bd51f5044bc743afb.

Reverted https://github.com/pytorch/pytorch/pull/151407 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally (see D73198095). To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts. ([comment](https://github.com/pytorch/pytorch/pull/151407#issuecomment-2821819654))
2025-04-22 16:12:42 +00:00
f072bf27a7 Revert "faster gather implementation (#151490)"
This reverts commit 541f8cd34cbccfcaf04a377f747390f83658d6ec.

Reverted https://github.com/pytorch/pytorch/pull/151490 on behalf of https://github.com/malfet due to Looks like it breaks demucs accuracy, though may be bogus, but let's try to revert, see c729f7dbee/3 ([comment](https://github.com/pytorch/pytorch/pull/151490#issuecomment-2821803788))
2025-04-22 16:09:14 +00:00
ed0d2ebaa0 Revert "Non-deterministic alert in histc_cuda for floating types only (#151701)"
This reverts commit b7a7741411585817daa81780b078fd15816f2d2d.

Reverted https://github.com/pytorch/pytorch/pull/151701 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing inductor tests to fail. See here for more info: test_torch.py::TestTorchDeviceTypeCUDA::test_nondeterministic_alert_histc_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/14586002763/job/40913547718) [HUD commit link](b7a7741411) ([comment](https://github.com/pytorch/pytorch/pull/151701#issuecomment-2821800837))
2025-04-22 16:07:25 +00:00
c729f7dbee [provenance_tracking][reland] Fix UT error and re-land ExternKernel support (#151709)
Summary:
ATT.

reverted previous diff :  D72572050

Test Plan:
```
 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing_extern_kernel
```

Differential Revision: D73281217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151709
Approved by: https://github.com/jingsh
2025-04-22 15:44:56 +00:00
d778c92e16 [Metal][BE] Move atomic ops to c10/metal/atomic.h (#151868)
To be reused from indexing and MPSInductor implementaiton of atomic_add stores
Added wrapper for `metal::atomic<int>`(to be used by followup PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151868
Approved by: https://github.com/Skylion007
2025-04-22 14:11:29 +00:00
159e2f96e3 [dynamo][ci] Fix recently broken test (#151877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151877
Approved by: https://github.com/masnesral, https://github.com/jansel
2025-04-22 06:42:03 +00:00
3aeeb77a3a [Dynamo][Easy] Remove unreachable code (#151739)
This line is unreachable:

f6c1cf04b5/torch/_dynamo/output_graph.py (L275)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151739
Approved by: https://github.com/Skylion007
2025-04-22 06:27:00 +00:00
ccd00359da Log information about suppressed data dependent errors (#151041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151041
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #151023, #151038
2025-04-22 06:07:57 +00:00
73d95893a2 Do not log exception when recording is disabled or already recording (#151038)
I am not sure why do we log all exceptions here and re-raise them , but at least when recording is disabled this should be
transparent. namely logging dde could be spamming.

before:
<img width="995" alt="Screenshot 2025-04-10 at 12 47 31 PM" src="https://github.com/user-attachments/assets/f90d4557-d958-4558-a917-0d687366cad1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151038
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #151023
2025-04-22 06:07:57 +00:00
dfdf731579 Do not generate long log messaged for suppressed data dependent errors. (#151023)
TORCH_LOGS="all" python test/test_dynamic_shapes.py -k test_guard_or_true

 before:
<img width="1065" alt="Screenshot 2025-04-10 at 9 55 27 AM" src="https://github.com/user-attachments/assets/3ee20de0-2902-4eb1-8ab0-80f1b974fb78" />

after:
<img width="1124" alt="Screenshot 2025-04-10 at 9 54 35 AM" src="https://github.com/user-attachments/assets/4e7e1f0c-856c-417f-8763-bfe183e2450d" />

Note: we actually do not expect to see a log at all, this is an orthogonal issue in recording where it logs each error seen
even when recording is not enabled? I will follow up with PR for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151023
Approved by: https://github.com/bobrenjc93
2025-04-22 06:07:57 +00:00
a09a3f4c30 [Hierarchical compile] Ensure output nodes are sorted last (#151295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151295
Approved by: https://github.com/anijain2305
ghstack dependencies: #151293, #151294
2025-04-22 05:13:07 +00:00
283884b224 [Hierarchical Compile] Handle autocast ctx manager (#151294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151294
Approved by: https://github.com/anijain2305
ghstack dependencies: #151293
2025-04-22 05:13:07 +00:00
4a643af992 [Hierarchical Compile] Fix small bug (#151293)
This technically would never be exposed because we never check that a node is an ancestor of itself, but it is good for it to be correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151293
Approved by: https://github.com/anijain2305
2025-04-22 05:13:07 +00:00
e76c0b159a Revert "[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)"
This reverts commit a02eae8142ddd8fbf068a3e17fc0dd276d92fc78.

Reverted https://github.com/pytorch/pytorch/pull/150127 on behalf of https://github.com/malfet due to Caused TestDynamoTimed.test_dynamo_timed to fail on macOS, see https://github.com/pytorch/pytorch/actions/runs/14584536979/job/40908019050 ([comment](https://github.com/pytorch/pytorch/pull/150127#issuecomment-2820081721))
2025-04-22 05:05:50 +00:00
0ff302e8e0 Revert "reroute index to fast implementation for indexing on 0th dimension (#151753)"
This reverts commit 4d78e19365c4e2189693c7a81b665d4ec2d2cf53.

Reverted https://github.com/pytorch/pytorch/pull/151753 on behalf of https://github.com/malfet due to Looks like it breaks bunch of distributed tests with DSA, see 4d78e19365 ([comment](https://github.com/pytorch/pytorch/pull/151753#issuecomment-2820078298))
2025-04-22 05:03:03 +00:00
95abc0f515 [c10d][fr] Fix another bug when we should continue when the op list is empty (#151798)
Differential Revision: D73375318

We shouldn't check the op list when it is empty. And later, when it is empty we pops it out from the queue we will check for collective matching. Added a unit test for this case and also covered the case fixed https://github.com/pytorch/pytorch/pull/151683 in the unit test as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151798
Approved by: https://github.com/d4l3k, https://github.com/wconstab, https://github.com/fegin
2025-04-22 04:43:31 +00:00
6f327128a9 [MKLDNN] Check that strides are positive (#151848)
For pooling ops. Prevents division-by-zero when argument is wrong

Fixes https://github.com/pytorch/pytorch/issues/149274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151848
Approved by: https://github.com/atalman
2025-04-22 04:25:47 +00:00
29811f68d2 [Inductor][FlexAttention] fix vars_and_sizes divisor error (#151634)
Triton codegen currently [sorts vars by divisor](ae6f6b8efb/torch/_inductor/codegen/simd.py (L233-L237)). When there are two vars with the same divisor, the order is undecided.

```python
nodes.sort(
   key=lambda x: V.graph.sizevars.size_hint(
       x.divisor, fallback=config.unbacked_symint_fallback
   )
)
```

The test case leads to the following nodes:
```
(Pdb) nodes[0]
IterationRangesEntry(x1, ((s37 + 127)//128), 2, (xindex//ps0), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})

(Pdb) nodes[1]
IterationRangesEntry(x0, 1, ((s37 + 127)//128), ModularIndexing(xindex, 1, ps0), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})

(Pdb) nodes[2]
IterationRangesEntry(x2, 2*(((s37 + 127)//128)), ((s12 + 127)//128), (xindex//(2*(((s37 + 127)//128)))), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})

(Pdb) V.graph.sizevars.statically_known_equals(nodes[0].length, 2)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[1].length, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[2].length, 1)
True

(Pdb) V.graph.sizevars.statically_known_equals(nodes[0].divisor, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[1].divisor, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[2].divisor, 2)
True
```

Since x1 and x0 both have divisor 1, the relative order is random across runs.
In some runs, we have order [x1, x0, x2] with divisors as [1,1,2] and lengths as [2,1,1]. After x1, we have [divisor = divisor * node.length](ae6f6b8efb/torch/_inductor/codegen/simd.py (L246)) = 1 * 2 = 2. Then, when processing x0, we have node.divisor=1, divisor=2, and [FloorDiv(node.divisor, divisor)](ae6f6b8efb/torch/_inductor/codegen/simd.py (L251)) = 0, which indicates an iteration length of 0 and leads errors later.

The fix is to sort by both divisor and length_is_one. So for two nodes with the same divisor, we process the node with length=1 first.

Fixes #149789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151634
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-04-22 04:24:56 +00:00
529f698ad4 [logging] Put "everything" WaitCounters in dynamo_timed (#151757)
Summary: The main motivation is to capture the cudagraphs overhead in a WaitCounter. We'll combine that with Triton autotuning, and therefore rename to "compile_runtime_overheads". Since we have a couple WaitCounters where we want to capture all runtime and compile overheads, let's put the accounting in dynamo_timed so we'll automatically capture any toplevel timed regions that get added in the future. Also, dynamo_timed already has to figure out if we're timing a runtime vs. compile-time event, so we can reuse some of that logic.

Test Plan:
Ran an internal model with `TORCHINDUCTOR_BENCHMARK_FUSION=1` (to get benchmarking at compile time in addition to runtime).

Overall compile time from various sources matches up:
* tlparse: https://fburl.com/9fgsstkr. Eyeballing, total time should be 32 ranks x 2175 = ~69.6k s
* ods: https://fburl.com/canvas/r4clhnb7. Right on.
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/ax71aqox. Right on.
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/shcjd9ql. Right on.

And the runtime overhead:
* ods: https://fburl.com/canvas/nvgjb282
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/f2dtv0qh

If we compare that to a run of the same model without the changes in this stack, results can mismatch by a lot:
* tlparse: https://fburl.com/cchxwd1s. Eyeballing, total time should be 32 ranks x 2300s = ~73.5k s
* ods: https://fburl.com/canvas/x1i3wvf4. It's kinda close
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/l7sgxdxd. Waaay too high.
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/jb4s9z1u. This is the only one that's actually correct.

The discrepancy is even worse if we focus on the runtime events:
* ods: https://fburl.com/canvas/a4o9f7ou
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/95izaes1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151757
Approved by: https://github.com/ppanchalia
ghstack dependencies: #151749
2025-04-22 03:29:13 +00:00
edba20b853 [logging] Fix duration logging for dynamo_compile (#151749)
Summary: There are a few issues I'm solving:.
1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events.
2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling.
3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety.

Test Plan:
Ran internal models and compared dynamo_compile to pt2_compile_events:
`TORCHINDUCTOR_BENCHMARK_FUSION=0`
* tlparse: https://fburl.com/itciwnxc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7

`TORCHINDUCTOR_BENCHMARK_FUSION=1`
* tlparse: https://fburl.com/jgurcvkw
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/uum91ceb
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/x4xnisez

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151749
Approved by: https://github.com/Skylion007
2025-04-22 03:29:13 +00:00
b7a7741411 Non-deterministic alert in histc_cuda for floating types only (#151701)
The note about atomic add only applies for floating point. The
implementation is deterministic for integer data types.

fixes: #151610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151701
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-04-22 03:24:36 +00:00
14e3ffb1ff Deprecate host allocator legacy APIs (#151437)
# Motivation
This PR aims to deprecate the host allocator legacy API and recommend users to use the unified API `getHostAllocator(device_type)` APIs, such as:
```cpp
at::getHostAllocator(device_type)->allocate(...);
at::getHostAllocator(device_type)->empty_cache();
at::getHostAllocator(device_type)->record_event(...);
at::getHostAllocator(device_type)->get_stats();
at::getHostAllocator(device_type)->reset_accumulated_stats();
at::getHostAllocator(device_type)->reset_peak_stats();
```

# Additional Context
TODO:
- [ ] Move is_pinned from `AcceleratorHookInterface` to `HostAllocator`
- [ ] Deprecate `getPinnedMemoryAllocator` inside `AcceleratorHookInterface` and recommend using `getHostAllocator` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151437
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #151403, #151431
2025-04-22 03:13:24 +00:00
a4fdae5c84 Lift guard checking logic to AOTAutogradCache (#151563)
This somewhat complicated PR does a few things:
- It separates out a lot of the guard checking logic into its own class, GuardedCache[T]
- It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic
- It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them.
- FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss.

# Why do this?
Lifting guards to AOTAutogradCache has a few benefits:
- First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. **I've added a unit test that failed before my diff, and now passes, as an example of this**
- Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries.
- Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself.

# Guard checking logic of AOTAutogradCache
When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to **evaluate** these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored.

After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic.

## Test Plan
Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563
Approved by: https://github.com/oulgen
2025-04-22 03:01:08 +00:00
40cf49d460 Revert "[Intel GPU] Allow XPU backend in Depthwise_conv2d&3d operators (#149114)"
This reverts commit 08831f30bbe745cd9f0c07d1868583a68f613514.

Reverted https://github.com/pytorch/pytorch/pull/149114 on behalf of https://github.com/guangyey due to CI is broken ([comment](https://github.com/pytorch/pytorch/pull/149114#issuecomment-2819890341))
2025-04-22 02:22:42 +00:00
a02eae8142 [dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)
For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export.
For infer_size: assumes if user passes us an unbacked, it's probably not -1

Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127
Approved by: https://github.com/laithsakka
2025-04-22 01:14:15 +00:00
80a3877b3d [easy] Fix test_dynamo_timed (#151816)
Summary: The structured logging counter is a global that might have been affected by earlier tests. Clear it explicitly.
Fixes #148093

Test Plan: `pytest test/dynamo/test_utils.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151816
Approved by: https://github.com/ppanchalia
2025-04-22 00:12:31 +00:00
b3b1616560 Add explict type info in the try-catch for dynamo logging (#151733)
Differential Revision: D73295871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151733
Approved by: https://github.com/hl475
2025-04-21 23:29:10 +00:00
a35e73b91f [c10] add #pragma once to leftright (#151710)
Summary: i am getting duplicate defn's when including in my binary that already includes the dispatcher.

Test Plan: CI

Differential Revision: D73237748

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151710
Approved by: https://github.com/georgiaphillips
2025-04-21 23:18:49 +00:00
99aeee2c5f [Inductor] Add Additional Configs for persistent+TMA version of Triton mm and addmm (#150587)
Summary:
This PR introduces additional autotuning configurations for the persistent+TMA version of Triton `mm` and `addmm` operations. The new configurations are as follows:
* `(128, 128, 64, 5, 8)`
* `(256, 128, 64, 4, 8)`
* `(128, 128, 64, 5, 4)`

These configurations were selected based on exhaustive autotuning performed on commonly used shapes from an internal foundational model.

While these new configs are generally more performant across the board, we see notable gains a few specific cases:
* In scenarios where `n >> m, k`, the configurations `(128, 128, 64, 5, 8)` and `(256, 128, 64, 4, 8)` tend to produce an additional 5-10% speedup over the aten baseline compared to the original configurations.
* Similarly, the configuration `(128, 128, 64, 5, 4)` yields approximately an 8% improvement in scenarios where k >> m, n.

These enhancements are expected to provide performance benefits across diverse use cases, particularly when compared to the original set of configurations.

Test Plan:
contbuild & OSS CI

Reviewers: paulzhan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150587
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/eellison
2025-04-21 23:18:33 +00:00
4d78e19365 reroute index to fast implementation for indexing on 0th dimension (#151753)
Per title, improve x[index] cuda perf for the common case of indexing along the first dim, using vectorized gather kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151753
Approved by: https://github.com/eqy
2025-04-21 23:15:30 +00:00
01f1cc44cb Rename register_fake_profile to unsafe_generate_fake_kernels (#151797)
Fixes https://docs.google.com/document/d/1BZsuUR1zJ-52Y7wP4yWX8beB4dwYbgdu5o1qKam_iWg/edit?disco=AAABiJdX1XU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151797
Approved by: https://github.com/zou3519
2025-04-21 23:08:15 +00:00
efdcc981d0 Back out "Do not propagate real tensor in extern kernel" (#151813)
Summary:
D73002775 breaks aot_compile for many draft exported models on PT2I dashboard. Revert.

Example error msg:

```
OrderedSet([]) >= OrderedSet([u1185, u1186, u1187]) (inductor >= fx)
fx node is: %embedding_bag_byte_prepack : [num_users=4] = call_function[target=torch.ops.quantized.embedding_bag_byte_prepack.default](args = (%view_10,), kwargs = {})
new operations are:
```

Differential Revision: D73381032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151813
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-04-21 22:54:03 +00:00
79a9447f0e FlexAttention add decorator for large test cases (#151459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151459
Approved by: https://github.com/Skylion007
2025-04-21 22:53:13 +00:00
6ea2e6a2d2 Do not do proper const fold during tensorify_python_scalars (#151494)
Chatting with Bob the goal of this is to const fold the floats that where tensorified by calling
guard_scalar(val) on them and then replacing their usages by their values.
Hence we do not need to do this for nodes with no float symbols.

We do not want todo proper const folding because we need to preserve statements that deferred
runtime asserts depend on. (see the added test)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151494
Approved by: https://github.com/bobrenjc93
2025-04-21 22:39:50 +00:00
cd1317f92f [export] suggest dynamic re-export in input constraints hook (#151624)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151624
Approved by: https://github.com/angelayi
2025-04-21 22:29:46 +00:00
c312d8c501 [Dynamo] Clean up old torch function flag (#149711)
This is tracked via `SymbolicTorchFunctionState` now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149711
Approved by: https://github.com/StrongerXi, https://github.com/anijain2305
2025-04-21 21:33:58 +00:00
25a11850e9 [symmem] Add some code comments to rendezvous code (#151716)
While reading and learning the rendezvous code, I just want to add some comments to explain the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151716
Approved by: https://github.com/kwen2501
2025-04-21 20:45:39 +00:00
352019bf9e [BE]: Better cleanup optimized code from #151474 (#151794)
This change addresses the first/second time/mem "spike" observed  Improves on #151474 by removing unnecessary stride calculations and unused arguments to the helper function

https://github.com/pytorch/pytorch/issues/151351

Fixes https://github.com/pytorch/pytorch/issues/151351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151794
Approved by: https://github.com/albanD, https://github.com/eqy
2025-04-21 20:32:11 +00:00
1f0d764b65 stage 2 of depreate silent fallback of tuning gemm (#148622)
context: https://github.com/pytorch/pytorch/issues/147479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622
Approved by: https://github.com/eellison
ghstack dependencies: #151506
2025-04-21 20:14:34 +00:00
02cecd1018 [inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)
Differential Revision:
[D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/)

Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506
Approved by: https://github.com/ColinPeppler
2025-04-21 20:14:34 +00:00
191b0237a6 Added to docs for out_dtype arg in torch gemms (#151704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151704
Approved by: https://github.com/bdhirsh
2025-04-21 20:09:17 +00:00
1a6effc5d8 [torch] Expose PCI info from CUDA device (#151672)
Summary:
PR#125083 add cuda device UUID info, but due to meta internal [version of ROCM the code was excluded](https://github.com/pytorch/pytorch/pull/125083?fbclid=IwY2xjawJvLnNleHRuA2FlbQIxMQABHlY55crrkTqWBWTsr2HVfuqnZ3R1GHR3o9Kf1o3h3uvyawEmCEdhdT48iY1P_aem_8tfrGrWE9SxFYasGfH8kCQ#issuecomment-2103315320).

This change will ensure meta internal code is built and PCI info is available

Test Plan: pass CI

Differential Revision: D73253426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151672
Approved by: https://github.com/Skylion007
2025-04-21 19:55:19 +00:00
2fb1326483 Add dates to pages (#151602)
re: #150873
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151602
Approved by: https://github.com/albanD
2025-04-21 19:53:55 +00:00
b7c7000728 Ensure runners have the required prefix (#151815)
Clone changes from https://github.com/pytorch/pytorch/pull/151696/ since that PR wouldn't merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151815
Approved by: https://github.com/seemethere
2025-04-21 19:09:17 +00:00
9680016bcf [MergeBot] Update PullRequestResolved Regex (#151814)
By copying an updated one from cff091f3f3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151814
Approved by: https://github.com/izaitsevfb, https://github.com/albanD
2025-04-21 19:02:05 +00:00
d79144da52 [BE] Move aarch64 docker build to larger node (#151808)
They happen once a week or so, not sure why it needs to be on the slowest machine possible

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151808
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2025-04-21 18:54:31 +00:00
fd04c79878 Revert "[aot autograd][logging] Profile large missing gaps in compile time tracing (#151256)"
This reverts commit 8e373592c8be3e28a5f5a774fc1d517aa3dbe8b4.

Reverted https://github.com/pytorch/pytorch/pull/151256 on behalf of https://github.com/Camyll due to breaking internal tests, cannot import ([comment](https://github.com/pytorch/pytorch/pull/151256#issuecomment-2819244186))
2025-04-21 18:49:23 +00:00
f37e138bc4 [MPS] Enable log1p and sigmoid for int64 (#151791)
It works on MacOS-15, but likely will need a skip for MacOS-13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151791
Approved by: https://github.com/Skylion007
ghstack dependencies: #151790
2025-04-21 18:30:04 +00:00
e2b1c06319 [cutlass] Define GELU_taylor<float> only if CUTLASS version is <= 380 (#151702)
Summary:
#buildmore

df8a550d39/include/cutlass/epilogue/thread/activation.h (L610)
was added in v3.9 (not tagged yet)

Test Plan:
mostly ci.

Logic seems same.

Reviewed By: drisspg

Differential Revision: D72615240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151702
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-21 18:23:46 +00:00
0f8613bf5c Introduce unsafe way to mark functions as cacheable (#151603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151603
Approved by: https://github.com/jamesjwu
ghstack dependencies: #151768, #151609
2025-04-21 17:37:38 +00:00
67c2869a38 Unpack the output code in the standalone_compile (#151609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151609
Approved by: https://github.com/zou3519
ghstack dependencies: #151768
2025-04-21 17:37:38 +00:00
287998b87f Run standalone compile tests on cpu/gpu (#151768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151768
Approved by: https://github.com/zou3519
2025-04-21 17:37:29 +00:00
cea43f721a [Testing] Unskip expm1 log1p for MPS (#151790)
But don't test them for unsupported dtypes (which is float64 for MPS)
- Skip int64 for log1p for now (next PR will fix that)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151790
Approved by: https://github.com/Skylion007
2025-04-21 17:18:47 +00:00
9374064483 Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)"
This reverts commit 783be8f93248ca3af24b968bdf84188f5a3257d1.

Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/malfet due to suspected of breaking linux builds and breaks internal tests as well ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2819041756))
2025-04-21 17:11:53 +00:00
33808f0ebd Revert "[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226)"
This reverts commit 8e5fefedf4af3f31ccd05290c1b21eedf6a4ad1b.

Reverted https://github.com/pytorch/pytorch/pull/151226 on behalf of https://github.com/malfet due to Reverting to unblock revert of https://github.com/pytorch/pytorch/pull/151404 ([comment](https://github.com/pytorch/pytorch/pull/151226#issuecomment-2819030735))
2025-04-21 17:07:49 +00:00
515a0f606b [ez] fix typo in comment (#151755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151755
Approved by: https://github.com/Skylion007
2025-04-21 14:52:39 +00:00
2eacdb91c3 Add OIDC permissions to xpu workflow (#151455)
The reusable workflow requires OIDC authentication to work and is configured via it's only caller xpu.yml however setting it here too to clarify that it is required. This setting also flags jobs that call this workflow without the required permissions set to remind them it need to be set.

JWT ID token requires `id-token: write` permissions as documented here https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#adding-permissions-settings

Ref: pytorch-fdn/multicloud-ci-infra#3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151455
Approved by: https://github.com/chuanqi129, https://github.com/atalman
2025-04-21 14:39:40 +00:00
bf28d1cafc Expose bicubic mode for torch::nn::functional::grid_sample in LibTorch (#150817)
When bicubic interpolation was added to grid_sampler in #44780, `GridSampleFuncOptions` was not updated to allow a user to use bicubic mode in LibTorch, even though the function could handle it. This PR fixes the parity such that LibTorch's  `torch::nn::functional::grid_sample` behaves the same as PyTorch's `torch.nn.functional.grid_sample`.

Existing users can directly use `torch::grid_sampler` but must know what int to pass for the interpolation (2 for bicubic) and padding mode parameters, which is not ideal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150817
Approved by: https://github.com/Skylion007
2025-04-21 08:55:27 +00:00
2a9afdae81 [Benchmarking] Add sam and stable_diffusion to MPS benchmarked models (#151748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151748
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #151747
2025-04-21 05:51:46 +00:00
f7ddc5125e [Easy] Fix the compilation warning of BlasKernel. (#151736)
As the title stated.

Change Before:
```C++
[2/21] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/BlasKernel.cpp.o
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:346:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char*, const int*, const int*, const scalar_t*, const scalar_t*, const int*, const scalar_t*, const int*, const scalar_t*, scalar_t*, const int*) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function]
  346 | void gemv_fast_path<at::Half>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:329:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function]
  329 | bool gemv_use_fast_path<at::Half>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:301:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char*, const int*, const int*, const scalar_t*, const scalar_t*, const int*, const scalar_t*, const int*, const scalar_t*, scalar_t*, const int*) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function]
  301 | void gemv_fast_path<at::BFloat16>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:273:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function]
  273 | bool gemv_use_fast_path<at::BFloat16>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151736
Approved by: https://github.com/shink, https://github.com/Skylion007
2025-04-21 03:31:46 +00:00
8eb21dffa9 consolidate ATen/test/dispatch_key_set_test.cpp with rest of DispatchKeySet tests (#151697)
Doesn't seem to be a reason to have two test files for this.

Differential Revision: [D73274020](https://our.internmc.facebook.com/intern/diff/D73274020/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151697
Approved by: https://github.com/Skylion007
ghstack dependencies: #151626, #151627, #151628, #151629, #151630
2025-04-21 02:58:12 +00:00
9c2ac2b876 [pytorch][triton] Enable warp spec for FlexAttention kernel (#150470)
Summary:
Given inductor support for warp-specialization for `TritonTemplateKernel`, this change adds:
- num_consumer_groups
- num_buffers_warp_spec

to the flexattention template generated by inductor in `torch.compile`.

NOTE: Currently default config doesn't enable warp-spec and needs explicit args for num_consumer_groups, num_buffers_warp_spec in the kernel options to enable.

Test Plan:
### Functional Testing
```Py
import torch
from torch.nn.attention.flex_attention import flex_attention
from triton.testing import do_bench
make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16)
q, k, v = make_tensor(), make_tensor(), make_tensor()
flex_compiled = torch.compile(flex_attention, fullgraph=True)
print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4, "num_consumer_groups": 2,
                "num_buffers_warp_spec": 3,})))
```
- (best config) without WS: 11.06
- with WS: 9.35

Differential Revision: D70501880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150470
Approved by: https://github.com/drisspg
2025-04-21 02:00:55 +00:00
fc2dd6d408 [Inductor] Update should_decompose_mm condition for CPU (#151730)
Summary:
Similar to what we did previously in D70033166

Previously, for cpu we decompose addmm if
```
check_device(mat1, mat2, device="cpu")
        and statically_known_true(mat1.shape[0] == 1)
        and statically_known_true(mat2.shape[0] <= 64)
        and statically_known_true(mat2.shape[1] <= 512)
```
We have a new case where `mat1.shape[0] = 80`, and benchmark shows that it will beneficial if we decompose, so update the condition to
```
check_device(mat1, mat2, device="cpu")
        and statically_known_true(mat1.shape[0] == 1)
        and statically_known_true(mat2.shape[0] <= 128)
        and statically_known_true(mat2.shape[1] <= 512)
```

Differential Revision: D73292985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151730
Approved by: https://github.com/kflu, https://github.com/houseroad
2025-04-21 01:56:47 +00:00
470132c6a1 [MPS] Add support for hermite_polynomial_he (inductor/eager). (#151754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151754
Approved by: https://github.com/malfet, https://github.com/jansel
2025-04-20 17:44:40 +00:00
c3a7278278 Use more efficient row/col computation (#151474)
This change addresses the first/second time/mem "spike" observed in

https://github.com/pytorch/pytorch/issues/151351

Fixes #151351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151474
Approved by: https://github.com/eqy, https://github.com/amjames, https://github.com/Skylion007
2025-04-20 16:02:19 +00:00
6b45b6e6c9 run lintrunner for Export d68846308 (#151725)
fixes broken lint tests in https://github.com/pytorch/pytorch/pull/151481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151725
Approved by: https://github.com/exclamaforte, https://github.com/Skylion007

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-20 14:58:17 +00:00
a40e876b08 Support fp8 dtypes in assert_close (#150002)
Fixes #135998

Adds support for fp8. These are compared bitwise, without atol and rtol. The implementation uses the same comparison functions, just with atol and rtol forced to zero. The error message is different from the default case; it only tells the user the first mismatch. This is to avoid triggering the error from #135998.

Test Plan:
New unit test covers new code paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150002
Approved by: https://github.com/cyyever, https://github.com/zou3519
2025-04-20 01:24:21 +00:00
48761e9737 Revert "[Easy] Fix the function signature of torch.Event (#151221)"
This reverts commit 92baeecbdd3fb717880485e529df4efb02627c9d.

Reverted https://github.com/pytorch/pytorch/pull/151221 on behalf of https://github.com/malfet due to This broke rocm tests, see 92baeecbdd (40818271233-box) ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))
2025-04-19 22:06:24 +00:00
c4482565cc Revert "[Easy][torch.Event] Fix and improve the docs of torch.Event (#151411)"
This reverts commit 1e1d0a4be63b354f762ee21bdccec03c1e5b371c.

Reverted https://github.com/pytorch/pytorch/pull/151411 on behalf of https://github.com/malfet due to This broke rocm tests, see 92baeecbdd (40818271233-box) ([comment](https://github.com/pytorch/pytorch/pull/151221#issuecomment-2816883409))
2025-04-19 22:06:24 +00:00
9b74ea2490 [Benchmarking] Run MPS benchmarks for [b]float16 (#151747)
And implicitly pass `--float32` when collecting results for "notset" option. Speedups for some models are much higher for float16 dtype, but it's important to track accuracy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151747
Approved by: https://github.com/Skylion007
2025-04-19 16:40:08 +00:00
ed511cd537 [Testing] Make test_add_complex3 run on different devices (#151732)
By constructing tensor on that device, because it does not call `self.common` but rather executes test directly.

Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732
Approved by: https://github.com/dcci
2025-04-19 14:29:13 +00:00
483e61bfec [BE][Easy]: Simplify reversed call in graph matcher (#151674)
Another list call on reversed that is no longer necessary since ItemViews reversed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151674
Approved by: https://github.com/albanD
2025-04-19 14:14:31 +00:00
68f748a992 Revert "[Testing] Make test_add_complex3 run on different devices (#151732)"
This reverts commit 414ce713fb329b20f93002fa4ffd6bb23bc3b93b.

Reverted https://github.com/pytorch/pytorch/pull/151732 on behalf of https://github.com/malfet due to It breaks MacOS-13 ([comment](https://github.com/pytorch/pytorch/pull/151732#issuecomment-2816690571))
2025-04-19 12:35:41 +00:00
1e1d0a4be6 [Easy][torch.Event] Fix and improve the docs of torch.Event (#151411)
**Changes:**
- add detailed function or class signature
- fix the wrong display of torch.Event.wait and torch.Event.record
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151411
Approved by: https://github.com/albanD
ghstack dependencies: #151226, #151221
2025-04-19 12:21:02 +00:00
92baeecbdd [Easy] Fix the function signature of torch.Event (#151221)
As the title stated.

The difference between declaration and implemention.
declaration:
d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)

Implementation:
d5a19e4525/torch/csrc/Event.cpp (L30-L32)

**Question**: Which one should we choose?
- Change enable_timing to False to be consistent with torch.cuda.Event
- Change enable_timing to True to avoid BC-break
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221
Approved by: https://github.com/albanD
ghstack dependencies: #151226
2025-04-19 11:56:37 +00:00
8e5fefedf4 [Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226)
Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users.

The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it.

Repro with cuda:
```
>>> import torch
>>> event = torch.cuda.Event()
>>> event.cuda_event
0
>>> event.event_id
0
>>> event.record()
>>> event.cuda_event
127982096
>>> event.event_id
0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226
Approved by: https://github.com/albanD
2025-04-19 10:42:00 +00:00
92d0c40c49 Revert "Cache the value of torch_key in subproc (#151057)"
This reverts commit 5f5805a6ac44179520291b2aa6e18d286dc93669.

Reverted https://github.com/pytorch/pytorch/pull/151057 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/151057#issuecomment-2816614510))
2025-04-19 08:48:12 +00:00
f6c1cf04b5 [ROCm][TunableOp] Support submatrices in offline tuning (#151138)
This PR adds support for submatrices in offline tuning for:
- GEMM
- GEMM and bias
- ScaledGEMM
- Batch Strided GEMM

New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately.

There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case.

To improve code readability, swapped definition of transA and transB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151138
Approved by: https://github.com/jeffdaily
2025-04-19 04:14:27 +00:00
2673ea4131 Add api to enable/disable NaN detector per-PG (#151723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151723
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2025-04-19 03:55:25 +00:00
414ce713fb [Testing] Make test_add_complex3 run on different devices (#151732)
By constructing tensor on that device, because it does not call `self.common` but rather executes test directly.

Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732
Approved by: https://github.com/dcci
2025-04-19 03:14:46 +00:00
6261db7719 Revert "inductor.config.descriptive_names = False is not actually supported (#145523) (#146051) (#151481)"
This reverts commit cfc4d74b0c9a0d21debbebb41e1dfa4dd2acf2a0.

Reverted https://github.com/pytorch/pytorch/pull/151481 on behalf of https://github.com/malfet due to It indeed breaks lint, it followup PR contains it's own issues ([comment](https://github.com/pytorch/pytorch/pull/151481#issuecomment-2816490764))
2025-04-19 03:12:56 +00:00
843e4d11ba [Benchmarking] Enable HF_GPT2 benchmarking on Metal (#151721)
By building wheel with USE_DISTRIBUTED=1

Otherwise attempt to run
```
python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend inductor --inference --devices mps
```
wil fail with
```
  File "/Users/nshulga/Library/Python/3.10/lib/python/site-packages/transformers/modeling_utils.py", line 40, in <module>
    import torch.distributed.tensor
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/__init__.py", line 4, in <module>
    import torch.distributed.tensor._ops  # force import all built-in dtensor ops
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/__init__.py", line 2, in <module>
    from ._conv_ops import *  # noqa: F403
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/_conv_ops.py", line 5, in <module>
    from torch.distributed.tensor._dtensor_spec import DTensorSpec, TensorMeta
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_dtensor_spec.py", line 6, in <module>
    from torch.distributed.tensor.placement_types import (
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/placement_types.py", line 8, in <module>
    import torch.distributed._functional_collectives as funcol
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/_functional_collectives.py", line 9, in <module>
    import torch.distributed.distributed_c10d as c10d
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 23, in <module>
    from torch._C._distributed_c10d import (
ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151721
Approved by: https://github.com/wdvr, https://github.com/dcci, https://github.com/huydhn
2025-04-19 02:57:03 +00:00
cfc4d74b0c inductor.config.descriptive_names = False is not actually supported (#145523) (#146051) (#151481)
Summary:

This config is not supported (it throws an error when set), and doesn't really make sense imo.

Approved by: https://github.com/eellison

Test Plan: contbuild & OSS CI, see edf266e9bb

Reviewed By: masnesral

Differential Revision: D68846308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151481
Approved by: https://github.com/masnesral
2025-04-19 01:13:35 +00:00
adf5f38eae Don't specialize min/max (#151347)
address https://github.com/pytorch/pytorch/issues/149635
Differential Revision: [D73041489](https://our.internmc.facebook.com/intern/diff/D73041489/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151347
Approved by: https://github.com/bobrenjc93
2025-04-19 00:11:15 +00:00
359e1d517c [Profiler] Remove Decref From Python Context (#151625)
Summary: When doing on-demand profiler with stack, the decref causes a segfault. I tried checking the refcount and the object itself and they both look fine but still segfaults every time. Lets remove it for now and revisit.

This will induce a small memory leak but it should be small enough that it does not create any significant impact on jobs ran.

Test Plan:
Removed decref and got clean traces
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1744933624/localhost/libkineto_activities_2936811.json.gz&bucket=gpu_traces

Differential Revision: D73225468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151625
Approved by: https://github.com/davidberard98
2025-04-18 23:55:19 +00:00
e48189cf03 Don't eagerly create AliasInfo in parseAliasDeclaration (#151630)
No need to create an AliasInfo...unless we need it.

Differential Revision: [D73129452](https://our.internmc.facebook.com/intern/diff/D73129452/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151630
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #151626, #151627, #151628, #151629
2025-04-18 22:51:37 +00:00
cac8d35503 Use fmt::format for debug strings in Library init (#151629)
Observed several ms taken during `import torch` by c10::str here.

Differential Revision: [D73129453](https://our.internmc.facebook.com/intern/diff/D73129453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151629
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet
ghstack dependencies: #151626, #151627, #151628
2025-04-18 22:51:37 +00:00
313ceb4da3 Reserve vector in StringCordView ctor (#151628)
Clear missing reserve (we should expect that pieces are not empty).

Differential Revision: [D73129445](https://our.internmc.facebook.com/intern/diff/D73129445/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151628
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #151626, #151627
2025-04-18 22:51:29 +00:00
704a504e8a Reserve vectors in FunctionSchema::cloneWithRealTypes (#151627)
1) reserving is much better than not reserving
2) std::transform for a 1-line-body loop is generally not considered to be an improvement (and doesn't get seem to get boiled away by clang under -Oz)

Differential Revision: [D73013363](https://our.internmc.facebook.com/intern/diff/D73013363/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151627
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #151626
2025-04-18 22:51:23 +00:00
fc7d493908 Overload Library::def rather than templating it (#151626)
It ends up being templated over a bunch of reference-to-array-of-characters types with different lengths, such as `char const (&) [88]`, which is an annoyance when profiling and possibly a source of code bloat.

Differential Revision: [D73129450](https://our.internmc.facebook.com/intern/diff/D73129450/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151626
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-04-18 22:51:16 +00:00
97d97aef24 Revert "[dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)"
This reverts commit 1dd2033c0a1de460ee2bad8d64c36a0344886071.

Reverted https://github.com/pytorch/pytorch/pull/150127 on behalf of https://github.com/clee2000 due to maybe caused export test to fail? export/test_draft_export.py::TestDraftExport::test_masked_linear [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538768138/job/40794985504) [HUD commit link](1dd2033c0a), bad TD ([comment](https://github.com/pytorch/pytorch/pull/150127#issuecomment-2816232086))
2025-04-18 21:38:47 +00:00
bd77c3e054 [easy] Update test/dynamo/test_structured_trace.py (#151606)
Summary: test/dynamo/test_structured_trace.py is out of date because of some new fields. (I guess the test is disabled?). Bring it up to date.

Test Plan: `python test/dynamo/test_structured_trace.py`

Fixes #149671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151606
Approved by: https://github.com/Skylion007
ghstack dependencies: #151599
2025-04-18 21:33:13 +00:00
56d318bfac [ONNX][Eazy] Update onnx program doc formatting and improve robustness (#151623)
- Update docstring list formatting
- Use a try finally block to keep the model unmodified if save() fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151623
Approved by: https://github.com/titaiwangms
2025-04-18 21:31:31 +00:00
02dd096e51 [invoke_subgraph][fake tensor] Add finalizer on subgraph instead of the functionalize ctx wrapper (#151633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151633
Approved by: https://github.com/zou3519
ghstack dependencies: #151330, #151256, #151357, #151477
2025-04-18 21:23:21 +00:00
b74be52454 [CUDA][NVTX] Move nvtx3 code from cmake/public/cuda.cmake to cmake/Dependencies.cmake (#151583)
Fixes [#147220]

Context: In the CUDA NVTX world, there are NVTX v2 and NVTX v3. As announced in CUDA release notes, e.g. [CUDA 12.8 Update 1]( https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-or-dropped-operating-systems) "`NVTX v2 is deprecated. To migrate to NVTX v3. Change your code from: #include <nvtoolsext.h> to #include "nvtx3/nvtoolsext.h`". This header is included in the toolkit."
On the PyTorch side, TORCH_CUDA_USE_NVTX3 compile time macro is used and it is set to true when (most of the time) nvtx3 is found. nvtx3 is found in two cases: 1) USE_SYSTEM_NVTX=0 (default), torch build process would automatically look for the nvtx3 in pytorch/third_party/nvtx. This is the most common and default case. 2) when USE_SYSTEM_NVTX=1 is used, nvtx3 is found from the installed CUDA toolkit (e.g. CUDA 12.8 and even some earlier cuda versions).
As described in #147220, the reason it can find pytorch/third_party/nvtx is because it used
6f035d8462/cmake/public/cuda.cmake (L176)
note the "PROJECT_SOURCE_DIR" usage in [pytorch/cmake/public/cuda.cmake](6f035d8462/cmake/public/cuda.cmake (L176))

Before this PR:
PyTorch build would succeed in finding nvtx3 due to the above described process, everything is good. But downstream projects like torchvision *can* fail, and would by default fail because the following are happening:
1) USE_SYSTEM_NVTX=0 is used (and most likely it is this case because it is the default)
2) NVTX v2 can no longer be found (e.g. future CUDA versions because deprecation would eventually become removal)
3) TorchVision cannot find NVTX3 either because torchvision was invoking [pytorch/cmake/public/cuda.cmake] but the PROJECT_SOURCE_DIR is no longer the pytorch source but torchvision source!
4) One workaround is to "USE_SYSTEM_NVTX=1" but users have to explicitly set this and do the plumbing work

After this PR:
PyTorch can still find nvtx3 because the part of the code that finds nvtx3 is just moved to a new place. The CI logs are showing it being able to find nvtx3. e.g. [this job](https://productionresultssa14.blob.core.windows.net/actions-results/47f8efaa-0afe-4e1f-bc94-0a82629941cb/workflow-job-run-dc8201b1-845b-5da1-a6ea-d3360ce1b508/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-04-18T20%3A38%3A05Z&sig=yMd6egC%2Banl3lR%2BudXFX18bfUH189z0DTGLtscHQJwY%3D&ske=2025-04-19T06%3A21%3A45Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-04-18T18%3A21%3A45Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-04-18T20%3A28%3A00Z&sv=2025-01-05), which reads "`Found nvtx3: C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/NVTX/c/include`"
For torchvision, it still invoke  [pytorch/cmake/public/cuda.cmake] but it no longer tries to find nvtx3 as torchvision is not using nvtx3 (if in future it uses, it can set USE_SYSTEM_NVTX=1 by default). So it would avoid the error reported in [#147220]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151583
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet
2025-04-18 21:18:09 +00:00
6e7b6e8d57 [c10d][fr] Fix a bug when first rank is not zero in the script (#151683)
Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group.

Test Plan: Directly test with corner case.

Differential Revision: D73266257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683
Approved by: https://github.com/fegin
2025-04-18 20:55:06 +00:00
a6e46faff4 Use reusable binary docker build action for manywheel (#151489)
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context

Similar to https://github.com/pytorch/pytorch/pull/151483 but for manywheel

Changed the job name

s390x doesn't have access to aws ecr so it doesn't use the action.  manylinuxs390x-builder ecr repo doesn't exist in docker hub so idk why the image name is that

Testing:
Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151489
Approved by: https://github.com/seemethere
2025-04-18 20:38:33 +00:00
b0f26e81a5 Use reusable binary docker build action for libtorch (#151488)
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context

Similar to https://github.com/pytorch/pytorch/pull/151483 but for libtorch

Changed the job name

Testing:
Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151488
Approved by: https://github.com/atalman
2025-04-18 20:37:38 +00:00
88b0553c58 [AMD] Remove fbcode limit for uuid (#151652)
Summary: We're now w/ later rocm version so ok to add uuid back.

Test Plan: sandcastle

Differential Revision: D73240086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151652
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/houseroad
2025-04-18 20:37:09 +00:00
7ffa9000ed Replace perf-nightly-macos with inductor-perf-nightly-macos (#151698)
The name was updated by https://github.com/pytorch/pytorch/pull/151155.  The benchmark results weren't updated on the dashboard otherwise.

For PT2 compiler perf benchmark, we are still relying on this old workflow.  To get rid of this, we need to update PT2 benchmark dashboard to use the new benchmark database (cc @yangw-dev)

The results are there on the new database:

```
SELECT
    *
FROM
    oss_ci_benchmark_v3
WHERE
    workflow_id = 14510035576
```

but not on the old database:

```
SELECT
    *
FROM
    inductor_torch_dynamo_perf_stats
WHERE
    workflow_id = 14510035576
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151698
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-04-18 20:31:36 +00:00
1b267a58a1 Revert "[export] allow partially specifying keys for dynamic shapes dict spec (#151597)"
This reverts commit c8240e3492e4813e822d7265eb3afb7f1168db39.

Reverted https://github.com/pytorch/pytorch/pull/151597 on behalf of https://github.com/clee2000 due to broke some export test export/test_converter.py::TestConverter::test_aten_len [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538615968/job/40792673415) [HUD commit link](c8240e3492), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151597#issuecomment-2816127271))
2025-04-18 20:17:44 +00:00
f20a266512 [easy] Update test/dynamo/test_utils.py (#151599)
Summary: test/dynamo/test_utils.py is out of date because of some new dynamo_timed fields. (I guess the test is disabled?). Bring it up to date

Test Plan: `python test/dynamo/test_utils.py`

Fixes #148093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151599
Approved by: https://github.com/Skylion007
2025-04-18 18:49:24 +00:00
e434a9152e Revert "[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)"
This reverts commit 6246c7d62ca2f091838d5c707e3d932994c5e35a.

Reverted https://github.com/pytorch/pytorch/pull/151506 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/151506#issuecomment-2815999009))
2025-04-18 18:40:17 +00:00
cccfc146fe [BE][Easy]: Simplify ModuleList reversed method (#151673)
Removes unnecessary list calls now that we are in Python 3.9 and KeyViews implement reversed directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151673
Approved by: https://github.com/albanD
2025-04-18 18:39:32 +00:00
b7807759de Revert "stage 2 of depreate silent fallback of tuning gemm (#148622)"
This reverts commit 181b3883e71b9771e8a3cdaf43d627f68e9f0fa6.

Reverted https://github.com/pytorch/pytorch/pull/148622 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/148622#issuecomment-2815995105))
2025-04-18 18:37:09 +00:00
b73606dcc5 Add jk for force_disable_caches (#151621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151621
Approved by: https://github.com/jamesjwu
2025-04-18 18:19:40 +00:00
9ccdeae7db Fix uint view copy (#151598)
Fix for https://github.com/pytorch/pytorch/issues/151156. We have some logic to undo our upcast prior to dtype bitcast. This pr cleans up that logic using dtypes in codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151598
Approved by: https://github.com/zou3519
ghstack dependencies: #151562
2025-04-18 18:13:39 +00:00
28974a1ec3 Revert "[Easy] Fix the compilation warning of BlasKernel. (#151302)"
This reverts commit 32c79da789af84312a0db2de19211a7c57196ba7.

Reverted https://github.com/pytorch/pytorch/pull/151302 on behalf of https://github.com/malfet due to Breaks builds without OpenMP, see https://github.com/pytorch/pytorch/issues/151680 ([comment](https://github.com/pytorch/pytorch/pull/151302#issuecomment-2815954855))
2025-04-18 18:10:45 +00:00
115a0c6413 add privateuse1 device type to pre forward hook of fsdp (#149487)
add privateuse1 device type to pre forward hook of fsdp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149487
Approved by: https://github.com/FFFrog, https://github.com/cyyever, https://github.com/shink, https://github.com/albanD
2025-04-18 17:50:23 +00:00
1a48382a4c [Easy] Optimize container.py typing (#151653)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151653
Approved by: https://github.com/albanD
2025-04-18 17:33:43 +00:00
931bd05560 Do not propagate real tensor in extern kernel (#151377)
Summary: See internal Diff for more details.

In ExternKernel, the FakeTensors do not have associated real tensors, because they are just created from ir.Node's shape and stride.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_data_dependent_ex

buck2 run mode/dev-nosan  fbcode//caffe2/test/inductor:aot_inductor_arrayref_cpu -- -r data_dependent_extern_kernel_op
```

Differential Revision: D73002775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151377
Approved by: https://github.com/angelayi
2025-04-18 17:28:13 +00:00
181b3883e7 stage 2 of depreate silent fallback of tuning gemm (#148622)
context: https://github.com/pytorch/pytorch/issues/147479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148622
Approved by: https://github.com/eellison
ghstack dependencies: #151506
2025-04-18 17:26:16 +00:00
6246c7d62c [inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)
Differential Revision:
[D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/)

Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506
Approved by: https://github.com/ColinPeppler
2025-04-18 17:26:16 +00:00
1dd2033c0a [dynamic shapes] guard_or_false for _reshape_view_helper, utils._infer_size for wildcard dims (#150127)
For reshape/view: removes fast paths for 0 elements, checking dimensions to skip. Modifies the loop accumulating input elements, to raise a UserError if we run out of dimensions, graph breaking for compile and erroring out for export.
For infer_size: assumes if user passes us an unbacked, it's probably not -1

Will think about changes in https://docs.google.com/document/d/1WYx6EZwVDXtBnWyrzoecgGWdiK0V3XZKftfpWwQ5i3E/edit?tab=t.0#heading=h.22k54zym11qp in a later PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150127
Approved by: https://github.com/laithsakka
2025-04-18 17:05:11 +00:00
c8240e3492 [export] allow partially specifying keys for dynamic shapes dict spec (#151597)
Fixes #148564

Should help with exporting HF-style models, so users don't have to specify 100 Nones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151597
Approved by: https://github.com/angelayi
2025-04-18 16:53:01 +00:00
9eaaca2ece Turn off symm_mem when cuda version is <12.3 (#151203)
Summary: It looks symmetric memory only supports cuda12.3+. We do have the definition w/ 12.3- but we don't have implementation. So maybe a good idea to even disable the definition.

Test Plan: CI

Reviewed By: jianyuh, houseroad, ngimel, jiawenliu64

Differential Revision: D72936993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151203
Approved by: https://github.com/ngimel, https://github.com/houseroad
2025-04-18 16:37:12 +00:00
783be8f932 [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)
As the title stated

**Changes:**
- Add **record**, **query** and **enable_timing** check
- Add related tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404
Approved by: https://github.com/albanD
2025-04-18 15:26:13 +00:00
29317f8585 [standalone_compile] Some misc fixes (#151502)
This PR fixes two things.

The first problem is that in the vLLM style standalone_compile is
called from within a custom torch.compile backend. If there already is a
FakeTensorMode (which there is), we shouldn't create a new
FakeTensorMode with the same shape_env, instead we should just reuse the
same FakeTensorMode.

The second thing is that compile_fx can mutate the passed in gm, so we
deepcopy (since standalone_compile should be standalone)

Test Plan:
- new test
- updated old tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151502
Approved by: https://github.com/oulgen
ghstack dependencies: #151501, #151551
2025-04-18 12:34:13 +00:00
58310a0043 [standalone_compile] support multiple returns (#151551)
We were only returning the first one. There's an edge case on what to do
if the original function returns a single Tensor. capture(f) returns a
function that returns a tuple of one Tensor in this case and we were
originally converting this back to one single Tensor. I think it's fine
to return a tuple of one Tensor (that is what the graph passed to
standalone_compile asked for!) but we can revisit.
fine

Test Plan:
- modified one test to used multiple outputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151551
Approved by: https://github.com/Skylion007, https://github.com/oulgen
ghstack dependencies: #151501
2025-04-18 12:34:13 +00:00
ac715e96b4 [standalone_compile] Don't check if path is directory if it doesn't exist (#151501)
os.path.isdir(path) will return False if the path doesn't exist.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151501
Approved by: https://github.com/Skylion007, https://github.com/oulgen
2025-04-18 12:34:13 +00:00
14293c2377 [MPS] Allow isin for mixed types (#151600)
To follow pattern set by CPU and CUDA impls: define common_dtype and optionally casts `elements` and `test_elements` to common dtype if needed

- Add regression test, though skip it on MacOS-13, as `isin` seems to produce garbage there even for same dtypes
```
>>> import torch
>>> x=torch.arange(4.0, device='mps')
>>> y=torch.arange(1.0, 3.0, device='mps')
>>> x, y, torch.isin(x, y), torch.isin(y, x)
(tensor([0., 1., 2., 3.], device='mps:0'), tensor([1., 2.], device='mps:0'), tensor([False,  True, False, False], device='mps:0'), tensor([False, False], device='mps:0'))
>>> torch.__version__
'2.6.0'
```
- Cleanup code a bit

Fixes https://github.com/pytorch/pytorch/issues/151443
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151600
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/kulinseth
2025-04-18 12:30:32 +00:00
675f69f40f collect_env: gracefully handle no pip (#151607)
If pip is not installed:

### Before

```console
> python3 torch/utils/collect_env.py
Collecting environment information...
Traceback (most recent call last):
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 694, in <module>
    main()
    ~~~~^^
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 677, in main
    output = get_pretty_env_info()
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 672, in get_pretty_env_info
    return pretty_str(get_env_info())
                      ~~~~~~~~~~~~^^
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 497, in get_env_info
    pip_version, pip_list_output = get_pip_packages(run_lambda)
                                   ~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 450, in get_pip_packages
    for line in out.splitlines()
                ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'splitlines'
```

### After

```console
> python3 torch/utils/collect_env.py
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: macOS 15.4 (arm64)
GCC version: Could not collect
Clang version: 20.1.0
CMake version: version 3.31.6
Libc version: N/A

Python version: 3.13.2 (main, Apr  8 2025, 15:27:33) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime)
Python platform: macOS-15.4-arm64-arm-64bit-Mach-O
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Apple M2 Pro

Versions of relevant libraries:
[pip3] Could not collect
[conda] Could not collect
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151607
Approved by: https://github.com/malfet
2025-04-18 12:28:58 +00:00
776aa68221 Update torch-xpu-ops commit pin (#150827)
Update the torch-xpu-ops commit to [b51dd3ef4f4d0f6b44c59e61431c5d29354dcaf6](b51dd3ef4f), including:
- Update commit pin to xpu-ops main branch
- Fixes batch_norm numeric error by adding additional boundary check
- Enable two operators: fft & jagged_to_padded_dense
- XCCL relevant changes:
1. Cache `cclStream` to improve performance.
2. Add support for complex datatypes in `allgather` and `broadcast`.
3. Support `coalescing` operations and `batch_isend_irecv`.
4. Introduce additional logging; use `export TORCH_CPP_LOG_LEVEL=INFO`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150827
Approved by: https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-18 10:12:59 +00:00
0376bbf5b3 [XPU] skip a subprocess UT for Windows (#150999)
This case creates subprocess in a subprocess. In Windows it can't load function at this scenario hence I have to skip it
```
File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'run_model' on <module '__main__' (built-in)>
Traceback (most recent call last):
  File "<string>", line 25, in <module>
  File "<string>", line 16, in test_multi_process
AssertionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150999
Approved by: https://github.com/guangyey, https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-18 08:55:47 +00:00
541f8cd34c faster gather implementation (#151490)
So far it's only for `gather`, but we'll move index_select and index to this implementation too. Torchtitan and fbgemm have noticed that gather/index_select perf is bad, this PR brings core implementation to be on par with those customized implementations. Added benefits: all dtypes are supported, a bit less strict on the tensor dimensions/contiguity because we pick the fast path after TensorIterator collapsed the dimensions.

Biggest part of this PR is not even the kernel (it's dumb, just vectorized loads are enough), but moving utilities for vectorized loads and stores from SymmetricMemory to be generally accessible in MemoryAccess.cuh.
Additional tests are coming to make sure this implementation doesn't break anything

`gather` is equivalent to x[indices] for 1d indices via
```
def fn_gather(x, indices):
    return torch.gather(x, dim=0, index=indices.unsqueeze(1).expand(-1, x.shape[1]))

def fn_index(x, indices):
    return x[indices]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151490
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-04-18 07:48:31 +00:00
eb1f85a2a0 Support C++ statically_known_true (#151346)
Differential Revision: [D73040543](https://our.internmc.facebook.com/intern/diff/D73040543/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151346
Approved by: https://github.com/laithsakka
2025-04-18 06:42:12 +00:00
8895c290f4 [Easy] enable PYFMT for torch/quantization/eager (#150761)
All modifications are done through tools, the detailed commands are as follows:

```bash
lintrunner -a --take "PYFMT" --all-files
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150761
Approved by: https://github.com/jerryzh168
2025-04-18 05:53:33 +00:00
91b090c912 [executorch hash update] update the pinned executorch hash (#151632)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151632
Approved by: https://github.com/pytorchbot
2025-04-18 05:07:28 +00:00
6649ed9deb [ez] fix code owners typo (#151499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151499
Approved by: https://github.com/laithsakka
2025-04-18 04:24:16 +00:00
bedefa46a9 Document non-pytorch CUDA memory allocation and how to query it (#150880)
This PR documents the fact that PyTorch does not have visibility into how every CUDA memory allocation happend - it only knows about allocations that went through the pytorch CUDA allocator.

It also adds a code snippet showing how to use pynvml to query current GPU memory usage.

## Preview
Added a note at the top of "Understanding CUDA Memory Usage" doc:
<img width="732" alt="image" src="https://github.com/user-attachments/assets/69e28d2a-841a-4b1b-b886-e96fb5d76582" />

which links to a section below:
<img width="733" alt="image" src="https://github.com/user-attachments/assets/cab4f252-9ac2-4fc6-a45d-fdb958fc7dbc" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150880
Approved by: https://github.com/kwen2501, https://github.com/ngimel
2025-04-18 03:48:54 +00:00
7d282da449 Add automatic categorization for release notes: inductor (aoti) (#151569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151569
Approved by: https://github.com/desertfire
ghstack dependencies: #151453
2025-04-18 03:39:06 +00:00
2426258789 [doc fix] fix torch export docs for preserve_module_call_signature (#151140)
The preserve_module_call_signature explanation is missing in the __init__.py. Copying that from _trace.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151140
Approved by: https://github.com/angelayi
2025-04-18 02:55:35 +00:00
33cfe30ee1 Add HostAllocator as the unified parent class (#151431)
# Motivation
This PR introduces a unified parent class `HostAllocator` with the following goals:
1. Enable backend-specific host allocator registration, including support for out-of-tree backends.
2. Provide a unified and extensible API surface for host memory management across all backends, especially accelerators.

The new interface includes:
- `at::getHostAllocator()->allocate`
- `at::getHostAllocator()->empty_cache`
- `at::getHostAllocator()->record_event`
- `at::getHostAllocator()->get_stats`
- `at::getHostAllocator()->reset_accumulated_stats`
- `at::getHostAllocator()->reset_peak_stats`

# Additional Context
We plan to deprecate legacy APIs such as `at::cuda::CachingHostAllocator_emptyCache` and recommend users migrate to the new backend-specific API, for example:
```cpp
at::getHostAllocator(at::kCUDA)->empty_cache();
```
This refactor will help standardize host memory management across devices and simplify backend integration in the future.
Another key improvement I am going to do is move the `is_pinned` functionality into the `HostAllocator` class, which enables centralized pinned memory verification through calls like `at::getHostAllocator(at::kCUDA)->is_pinned(ptr)`.
Benefits include:
 - Consistent host memory handling across all device backends
 - Decouple pinned memory functionality with `AcceleratorHooksInterface` in a more modular way
 - Clearer separation between device memory allocation and pinned host memory management

This architecture makes the system more maintainable and extensible for future device support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151431
Approved by: https://github.com/albanD
ghstack dependencies: #151403
2025-04-18 02:44:17 +00:00
1cc5a8452b [Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091)
As the title stated.

Related PR: https://github.com/pytorch/pytorch/pull/147066

Co-authored-by: Zhenbin Lin <lin-zhenbin@qq.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151091
Approved by: https://github.com/albanD
ghstack dependencies: #151007
2025-04-18 02:40:07 +00:00
3528488061 [Openreg][PrivateUse1] Enable CI for openreg (#151007)
Changes:
- move test_openreg.py from test/cpp_extensions/open_registration_extension/ to test/
- update README.md for openreg
- enable CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151007
Approved by: https://github.com/albanD
2025-04-18 02:40:07 +00:00
09e8ff92cc refresh benchmark results (#151622)
updating due to <1.5% increases in https://github.com/pytorch/pytorch/pull/151469
not all benchmarks were updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151622
Approved by: https://github.com/oulgen
2025-04-18 02:39:13 +00:00
98c892749b c10d/Store: add nonblocking mode to queue_pop (#151485)
This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue.

We also expose a `torch.distributed.QueueEmptyError` so users can catch the error and handle it accordingly.

Test plan:

```
pytest test/distributed/test_store.py -k queue -v -s -x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151485
Approved by: https://github.com/fduwjj, https://github.com/tianfengfrank
2025-04-18 02:14:50 +00:00
3ed5f1fb77 [CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs (#150812)
Enable FP32 output from FP16/BF16 GEMMs in aten with cuBLAS. Accumulation for these GEMMs are generally already done in FP32. Adds the functionality to the following aten operators:
* mm
* bmm
* addmm
* baddmm

Follow up of customer issue: https://github.com/pytorch/pytorch/issues/146241#issuecomment-2781889390

Differential Revision: [D73126191](https://our.internmc.facebook.com/intern/diff/D73126191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150812
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-04-18 01:53:26 +00:00
a6182903cd Update PyTorchStreamReader API to take cpu allocator override (#150439)
Summary: Add allocator param in getRecord

Test Plan:
newly added UT
```
buck test caffe2/caffe2/serialize:inline_container_test
```

Differential Revision: D72252585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150439
Approved by: https://github.com/albanD
2025-04-18 01:53:14 +00:00
b434322075 Fix has_free_symbols (#151492)
used to fail for
        self.assertFalse(has_free_symbols(sympy.S.true))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151492
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #151170, #151171
2025-04-18 01:19:01 +00:00
c2a202169d Fix implicit state dict modification (#151436)
Summary: Previously we were modyfing ep.state_dict while runnning decomp which it shouldn't

Test Plan: CI

Fixes: https://github.com/pytorch/pytorch/issues/151366

Differential Revision: D73102315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151436
Approved by: https://github.com/angelayi
2025-04-18 00:58:55 +00:00
34266836d5 [Inductor] Suppress cuda init error for CPU only Inductor (#151528)
**Summary**
After https://github.com/pytorch/pytorch/pull/151255, invoking `torch.compile` on a non-CUDA device prints the following error:
`E0416 23:39:55.953000 418833 torch/_inductor/codegen/cuda/cuda_env.py:22] Error getting cuda arch: Torch not compiled with CUDA enabled.`
This PR updates the code to initialize `PRESETS` only when CUDA is available, preventing this error message from being printed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151528
Approved by: https://github.com/jansel, https://github.com/henrylhtsang
2025-04-18 00:55:01 +00:00
9e235c549c [C10D] avoid computing global_rank when group_rank is used (#151373)
collective APIs accept either group or global rank for src/dst rank.

We provide a helper `_canonicalize_group_rank` which converts from maybe
group or maybe global to one particular format (defined by the kwarg
return_global: bool=False).

In this PR we stop performing the mapping lookup that converts group to
global or global to group in the case that the caller wants us to return
the same value that was passed in.  The PR should be functionally
equivalent, except in cases where the mapping itself would raise an
exception but the mapping was not necessary in the first place.

This has come up in cases where people create new process groups outside
of 'init_process_group' APIs and group-specific ranks may not have a
valid mapping to the 'global' rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151373
Approved by: https://github.com/xunnanxu, https://github.com/d4l3k
2025-04-17 23:53:50 +00:00
d8bafd23ab [DDP] add one option to allow skipping all reduce unused parameters (#151503)
Summary: add one option to allow skipping all reduce unused parameters, this could help improve training throughput significantly when the number of unused parameters is large in the model.

Test Plan: unit tests, CI

Differential Revision: D72282069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151503
Approved by: https://github.com/mrshenli
2025-04-17 23:30:19 +00:00
6d46b530fc Remove libdevice ops in inductor (#151562)
Now that we track dtypes during codegen, we can delete all these extra ops that worked around the problem by doing dispatch at lowering time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151562
Approved by: https://github.com/isuruf, https://github.com/jansel
2025-04-17 22:18:00 +00:00
bdb34f55a0 [fake tensor cache] Support index with non bool/int8 indices (#151477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151477
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #151330, #151256, #151357
2025-04-17 21:51:08 +00:00
0129c3a4e1 Use reusable binary docker build action for almalinux, clean up script (#151483)
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context

Use the binary docker build action from https://github.com/pytorch/pytorch/pull/151471

Change the workflow trigger to be all of .ci/docker so it will make a new image + tag whenever it changes.

build script:
* change to be independent of the CUDA_VERSION env var, since all the info should be in the imagename:tag
* remove docker push parts since that will happen during the workflow
* clean up a bit
* make the build script more like the CI build script (use a temp image name)

I don't think this image is actually used anywhere

Also push docker image to imagename:tag, I got rid of it in the PR making the reusable workflow since I thought it was not in the original scripts but it actually is there
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151483
Approved by: https://github.com/ZainRizvi
2025-04-17 21:32:56 +00:00
652fa451a4 [dynamo] support fb internal bytecode EAGER_IMPORT_NAME (#151362)
Differential Revision: [D73127097](https://our.internmc.facebook.com/intern/diff/D73127097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151362
Approved by: https://github.com/oulgen
2025-04-17 21:19:45 +00:00
d5dda82586 [export] Integrate meta kernel generation with draft-export (#150809)
If a custom operator does not contain a fake impl, currently draft-export will use the real-tensor propagation to get an output for the operator and continue tracing. However if we retrace the exported model using `ep.run_decompositions`, or `export`, or run the exported program with fake tensors, we'll still fail because there's no fake impl.

With this PR, after draft-export we will generate an operator profile for each operator call that we encounter, and store this on the report attached to the exported program `ep._report.op_profiles`. Users can then use `torch._library.fake_profile.register_fake_profile` to temporarily generate and register a fake impl based on these operator profiles. This way future fake tensor retracing will work.

The workflow would look something like:
```python
class M(torch.nn.Module):
    def forward(self, a, b):
        res = torch.ops.mylib.foo8(a, b)  # no fake impl
        return res

ep = export(M(), (torch.ones(3, 4), torch.ones(3, 4)) # this fails bc no fake impl
ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4))

ep.run_decompositions()  # this fails bc no fake impl
# this registers fake impls based on the profiles
with torch._library.fake_profile.register_fake_profile(ep._report.op_profiles):
    decomp = ep.run_decompositions()  # this works

new_inp = (
    torch.ones(2, 3, 4),
    torch.ones(2, 3, 4),
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150809
Approved by: https://github.com/zou3519
2025-04-17 20:52:31 +00:00
4f62dccbda [Cutlass] Implement Epilogue Argument emitter (#150903)
This implements epilogue visitor tree argument generation (example type [here](3fe62887d8/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp (L332))).

Details:
The codegen task here is to implement a function which can generate a tree of C++ structs and properly extract the correct properties from Inductor buffers and write them to the correct locations in the generated struct. To implement this with the minimum amount of code, I generate the cutlass DAGIR (the EVT internal represenation) which specifically has a pass, [pass_argument_type.py ](5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L4)) which generates a nested tree of custom argument types for each node in the DAGIR. This nested tree of constructors is then passed kwargs to fill in the proper values, where the node's name is used to differentiate between different values in the kwarg dictionary. This however is non-customizable; the nested tree of EVT args is a nested tree of ctypes which looks for *actual values* so that this object can be passed directly to the cutlass-python C++ runner. Inductor on the other hand needs to fill this struct with string C++ expressions representing the values (or extracting the values from kernel launcher args). So `_render_argument_type` implements this: it iterates over the tree of types created by pass_argument_type.py and generates a string representing the nested structs, filling in C++ expressions representing the different fields.

Long term plan:
Long term, I will ask the nvidia to provide an overridable [visitor_factory](5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L82)) which could allow us to override the behavior of pass_argument_type.py to generate the string we would like during DAGIR generation.

Previously merged:
* #150346
* #150345
* #150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150903
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-04-17 20:30:21 +00:00
8e0f9fbccf [c10] helpers for runtime c10::alias re-use (#151361)
Summary: we need these to check whether the input tensor was re-sized/strided between executions when choosing to alias

Test Plan: CI

Reviewed By: henryoier

Differential Revision: D73061676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151361
Approved by: https://github.com/SherlockNoMad
2025-04-17 20:27:17 +00:00
da580123a0 [BE][Easy]: Dedupe a TypeAlias in PrimsCommon (#151565)
Replaces a duplicate TypeAlias with a reference to the global constant for them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151565
Approved by: https://github.com/albanD
2025-04-17 19:59:41 +00:00
c4688af254 Fix lint
Introduced by fb6ac2f16132f7953711ce6924bc2ee4a033228c
2025-04-17 12:48:52 -07:00
473a38b562 [DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320)
Summary: As titled.

Test Plan: CI

Differential Revision: D73040700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151320
Approved by: https://github.com/saumishr
2025-04-17 12:48:39 -07:00
c5b10ff119 [BE][Easy]: Normalize Dim typing in torch distributed (#151566)
Improve typing using prims_common dtypes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151566
Approved by: https://github.com/albanD
2025-04-17 19:30:09 +00:00
2ed2cb5805 add generalized pareto distribution (GPD) (#135968)
Add the GPD as a distribution class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135968
Approved by: https://github.com/albanD

Co-authored-by: Alexander März <statmixedmlgit@gmail.com>
2025-04-17 18:51:02 +00:00
7e2081fa93 Optimize interpolate saturate description (#151304)
Fixes #108225

## Test Result

### Before

![image](https://github.com/user-attachments/assets/bdbf8a5c-d5a4-44a5-b81e-2cbb5b8bfd02)

### After

![image](https://github.com/user-attachments/assets/1c21a27d-1700-4661-9988-dbb1cdc81fa2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151304
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-04-17 18:34:29 +00:00
055e59e709 [bazel] Build flatbuffers within bazel (#151364)
This is similar to how we handle protobufs and it makes it more convenient for bazel users to handle their version of flatbuffers. (Flatbuffers is very picky about the generated code matching the runtime). Instead of using the checked in generated code, we generate it on the fly.

This is relevant to https://github.com/pytorch/pytorch/issues/112903, because having the version of flatbuffers tied to pytorch will make pytorch difficult to use as an external workspace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151364
Approved by: https://github.com/malfet
2025-04-17 18:33:51 +00:00
3a6b3c8e0e Combine windows x64 and arm64 yaml template files (#149850)
While introducing Windows-Arm64 nightly workflows, we created a separate template file for win-arm64. This PR combines x64&arm64 and deletes the win-arm64 one.
Fixes #148776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149850
Approved by: https://github.com/ozanMSFT, https://github.com/malfet
2025-04-17 17:58:55 +00:00
1ce7969e81 Revert "[Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)"
This reverts commit 90c5b86cd8fcbbe6ee7c46ad17a05767f884794b.

Reverted https://github.com/pytorch/pytorch/pull/151404 on behalf of https://github.com/clee2000 due to broke a cpp extension test? test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/14519277500/job/40736981315) [HUD commit link](90c5b86cd8), bad TD ([comment](https://github.com/pytorch/pytorch/pull/151404#issuecomment-2813649667))
2025-04-17 17:45:41 +00:00
ae6f6b8efb [Inductor] Remove singleton tiling splits when prefer_nd_tiling=True (#151508)
# Issue
Users who want block pointers are like to use the config settings `{"trition.use_block_ptr": True, "triton.prefer_nd_tiling": True, "triton.max_tiles": 3}` . Among other things, these settings allow us to generate 3D block pointers for broadcasts. However, broadcasts which don't truly require 3D often end up introducing a superfluous tiling dimension of size 1.

For example, given this function with elementwise multiplication:
```
def foo(x, y, z):
            a = x * y
            b = 128.0
            c = a * b
            d = a * z
            e = x * z
            return a, c, d, e

inps = [
            torch.randn((8, 11, 128), device=self.device),
            torch.randn((128,), device=self.device),
            torch.randn((8, 11, 128), device=self.device),
]

torch.compile(foo)(*inps)
```

We get the following Triton kernels:
```
@triton.jit
def triton_poi_fused_mul_0(in_ptr0, in_ptr1, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    znumel = 88
    ynumel = 1
    xnumel = 128
    zoffset = tl.program_id(2) * ZBLOCK
    zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None]
    zmask = zindex < znumel
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
    ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1)
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :]
    xmask = xindex < xnumel
    x1 = xindex
    z0 = zindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last')[:, None, :]
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[128], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :]
    tmp2 = tmp0 * tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp2, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')

@triton.jit
def triton_poi_fused_mul_1(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, xnumel, XBLOCK : tl.constexpr):
    xnumel = 11264
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp3 = tl.load(tl.make_block_ptr(in_ptr1, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp5 = tl.load(tl.make_block_ptr(in_ptr2, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp1 = 128.0
    tmp2 = tmp0 * tmp1
    tmp4 = tmp0 * tmp3
    tmp6 = tmp5 * tmp3
    tl.store(tl.make_block_ptr(out_ptr0, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
    tl.store(tl.make_block_ptr(out_ptr1, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp4, [XBLOCK]).to(tl.float32), boundary_check=[0])
    tl.store(tl.make_block_ptr(out_ptr2, shape=[11264], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp6, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

Note that one kernel has `ynumel=1`. The extra dimension results in more expensive address calculations, and also seems to prevent fusion.

# Fix

To fix this, this PR filters out any splits of size 1 from the `prefer_nd_tiling` algorithm. This results in the following fused kernel, with 2D tiling:

```
@triton.jit
def triton_poi_fused_mul_0(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 88
    xnumel = 128
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[:, None]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, :]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[128], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :]
    tmp5 = tl.load(tl.make_block_ptr(in_ptr2, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp2 = tmp0 * tmp1
    tmp3 = 128.0
    tmp4 = tmp2 * tmp3
    tmp6 = tmp2 * tmp5
    tmp7 = tmp0 * tmp5
    tl.store(tl.make_block_ptr(out_ptr0, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp2, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
    tl.store(tl.make_block_ptr(out_ptr1, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp4, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
    tl.store(tl.make_block_ptr(out_ptr2, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp6, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
    tl.store(tl.make_block_ptr(out_ptr3, shape=[88, 128], strides=[128, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp7, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```

# Test plan
Added the test case above to CI. Checked that a single kernel is generated with 2D tiling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151508
Approved by: https://github.com/jansel
2025-04-17 17:37:45 +00:00
b4550541ea [ROCm] upgrade nightly wheels to rocm6.4 (#151355)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151355
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-17 17:29:07 +00:00
ef64beb232 Include post grad gm and fx runnable in cache artifacts for tlparse (#151469)
Fixed #151462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151469
Approved by: https://github.com/bdhirsh
2025-04-17 17:14:13 +00:00
ee3366dbb2 [MegaCache] Encode key in base64 (#151472)
I have noticed that there are some errors like
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 169302: invalid start byte
```

I havent been able to repro this locally yet, this change should fix the encoding issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151472
Approved by: https://github.com/masnesral
2025-04-17 17:12:22 +00:00
8404c09b15 [MegaCache] Rename the PGO artifact when used between different jobs (#151482)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151482
Approved by: https://github.com/bobrenjc93, https://github.com/jamesjwu
2025-04-17 17:09:29 +00:00
fe90a5c140 [Easy] Optimize clip_grad param description (#151532)
Fix missing optional description in `clip_grad_norm_` and `clip_grad_value_`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/3393dd4b-a730-4dd4-8304-9b895ac669d4)

![image](https://github.com/user-attachments/assets/220c4738-a728-474b-b06d-b5be7660d150)

### After

![image](https://github.com/user-attachments/assets/5637bb68-3b6d-49a3-8ee1-3af636950aa0)

![image](https://github.com/user-attachments/assets/c0f1d966-a9ba-4fac-a874-9d4955f6e0d6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151532
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-04-17 16:47:38 +00:00
c3a18f6126 [AOTInductor] Add states for constant folding process (#151273)
Summary:
We add states in the constant folding process for AOTInductor.
Basically, there's 3 states, which is
(1) None: The state when no constants are loaded and uninitialized.
(2) Initialized: The state when constants are loaded, but not yet
folded.
(3) Folded: The state where the model is fully ready with folded
constants.

Note that even if constant folding is not enabled, we still only run
when state is FOLDED, this is okay because without constant folding, the
transition from INITIALIZED to FOLDED is just a pass-throught.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-04-17 16:41:38 +00:00
4843ce7611 [BE] Remove outdated script to check namespace BC (#151453)
Now that we have bc_lint in CI, this script is no longer needed (nor has it ever been conclusive). I've already updated the Runbook to not need this script.

Suppressing bc_lint as this script is not shipped as a part of torch--it is not user facing! For context, this script is (rarely) used by the release notes manager to ensure BC across releases. It had been broken for at least since 2.6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151453
Approved by: https://github.com/albanD, https://github.com/jbschlosser
2025-04-17 15:43:53 +00:00
90c5b86cd8 [Easy] Add more check for elapsedTime of torch.xxx.Event and torch.Event (#151404)
As the title stated

**Changes:**
- Add **record**, **query** and **enable_timing** check
- Add related tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151404
Approved by: https://github.com/albanD
2025-04-17 15:30:12 +00:00
7f528751cc [Inductor] fix torch._inductor.exc.InductorError: KeyError (#151424)
Fixes #151423, which is a regression after #150845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151424
Approved by: https://github.com/eellison
2025-04-17 15:07:43 +00:00
bb11122e12 Update docker image names for s390x (#151426)
Disable switching tag for s390x docker images

Keep it that way unless they are published.
There's no way to determine in advance
which docker image names are needed
for building s390x binaries otherwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151426
Approved by: https://github.com/malfet, https://github.com/seemethere
2025-04-17 12:47:23 +00:00
fa6e842527 [MPS] Make fused rms_norm traceable (#150661)
Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly.

- Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`,  which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops
-
- Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py`
- Added unit test to avoid those regressions in the future

TODO:
- Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml`
- Benchmark compiler and re-enable decomp as follows when compiled code is faster
```python
@register_decomposition(aten._rms_norm_fused)
def rms_norm_fused(
    self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float
) -> torch.Tensor:
    dtr = [self.dim() - i - 1 for i in range(ndim)]
    return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt())
```

Fixes https://github.com/pytorch/pytorch/issues/150629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661
Approved by: https://github.com/manuelcandales, https://github.com/jansel
2025-04-17 11:32:00 +00:00
41b82611ee Revert "[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756)"
This reverts commit 300e0ee13c08ef77e88f32204a2e0925c17ce216.

Reverted https://github.com/pytorch/pytorch/pull/144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with  TypeError: unsupported operand type(s) for |: 'set' and 'list' ([comment](https://github.com/pytorch/pytorch/pull/144756#issuecomment-2812525970))
2025-04-17 11:09:01 +00:00
e4fe67f623 Revert "[MPS] Make fused rms_norm traceable (#150661)"
This reverts commit 682f09ec51526aefe6b504ac8081944baa866556.

Reverted https://github.com/pytorch/pytorch/pull/150661 on behalf of https://github.com/malfet due to Has decomp started to fail again ([comment](https://github.com/pytorch/pytorch/pull/150661#issuecomment-2812520408))
2025-04-17 11:06:05 +00:00
32c79da789 [Easy] Fix the compilation warning of BlasKernel. (#151302)
As the title stated.

Change Before:
```C++
[2/21] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/BlasKernel.cpp.o
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:346:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char*, const int*, const int*, const scalar_t*, const scalar_t*, const int*, const scalar_t*, const int*, const scalar_t*, scalar_t*, const int*) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function]
  346 | void gemv_fast_path<at::Half>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:329:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::Half]’ defined but not used [-Wunused-function]
  329 | bool gemv_use_fast_path<at::Half>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:301:6: warning: ‘void at::native::blas_impl::gemv_fast_path(const char*, const int*, const int*, const scalar_t*, const scalar_t*, const int*, const scalar_t*, const int*, const scalar_t*, scalar_t*, const int*) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function]
  301 | void gemv_fast_path<at::BFloat16>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/BlasKernel.cpp:273:6: warning: ‘bool at::native::blas_impl::gemv_use_fast_path(char, int64_t, int64_t, scalar_t, int64_t, int64_t, scalar_t, int64_t) [with scalar_t = c10::BFloat16]’ defined but not used [-Wunused-function]
  273 | bool gemv_use_fast_path<at::BFloat16>(
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151302
Approved by: https://github.com/malfet, https://github.com/aditew01
ghstack dependencies: #151427
2025-04-17 10:50:22 +00:00
f29fe78cf2 [Dynamo] Implement sourceless named tuple support (#151266)
Fixes https://github.com/pytorch/pytorch/issues/140903

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151266
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/anijain2305
2025-04-17 08:43:03 +00:00
49c91b4be9 [Easy][Building] Fix the warning of int4mm.cu when building (#151427)
As the title stated.

**Changes Before:**

```C++
[999/1526] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/int4mm.cu.o
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/cuda/int4mm.cu(142): warning #177-D: variable "at::native::kWarpSize" was declared but never referenced
  constexpr int32_t kWarpSize = 32;
                    ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151427
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-04-17 08:21:32 +00:00
a05cc9f494 Remove Clear Cache Time from do_bench_using_profiling (#150696)
Summary: In most instances, this action would take ~33% of the total run time, which means that our benchmark would previously differ from the end results by a lot.

Test Plan:
We can compare the benchmark results for
```
CUDA_VISIBLE_DEVICES=4,5 buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-snapshot-id=672308665_0 --lower-backend=AOT_INDUCTOR --node-replacement-dict="{'torch.nn.Linear':{'(autotune)': 'fp8_float_model_dynamic_quantization_rowwise'}}" --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024
```
before and after the diff, and notice that on average, the benchmark results decrease by ~0.1ms per iteration, which is more closely aligned with the lowered modules.

Differential Revision: D72469845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150696
Approved by: https://github.com/frank-wei
2025-04-17 07:25:41 +00:00
e0f05229e9 [ez] Make relaxed constraint error message more user friendly (#151407)
Fixes #151356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407
Approved by: https://github.com/Skylion007
2025-04-17 06:43:10 +00:00
10a54ffe5a [inductor] Reduce runtime of CPU OpInfo tests (#151435)
`has_triton()` returns True if Triton is present on the system and supports _any_ backend we care about. In this case, that means we _always_ check gradients, even though the intended behavior is to skip gradients when testing on CPU.

Fixes a bug from #146911.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151435
Approved by: https://github.com/masnesral
2025-04-17 05:25:14 +00:00
b7d9f44602 [executorch hash update] update the pinned executorch hash (#151493)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151493
Approved by: https://github.com/pytorchbot
2025-04-17 05:14:12 +00:00
682f09ec51 [MPS] Make fused rms_norm traceable (#150661)
Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly.

- Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`,  which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops
-
- Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py`
- Added unit test to avoid those regressions in the future

TODO:
- Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml`
- Benchmark compiler and re-enable decomp as follows when compiled code is faster
```python
@register_decomposition(aten._rms_norm_fused)
def rms_norm_fused(
    self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float
) -> torch.Tensor:
    dtr = [self.dim() - i - 1 for i in range(ndim)]
    return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt())
```

Fixes https://github.com/pytorch/pytorch/issues/150629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661
Approved by: https://github.com/manuelcandales, https://github.com/jansel
2025-04-17 04:15:24 +00:00
17ea9d1478 Revert "[DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320)"
This reverts commit fb6ac2f16132f7953711ce6924bc2ee4a033228c.

Reverted https://github.com/pytorch/pytorch/pull/151320 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/151320#issuecomment-2811669325))
2025-04-17 03:57:03 +00:00
a94483329c [MPS] Start benchmarking compile results (#151155)
To know passrate and speedup
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151155
Approved by: https://github.com/dcci
2025-04-17 02:45:39 +00:00
f5851efed9 Fix torch.autograd.backward inputs validation (#150975)
- Fixes #150883
- Fixes #70504

This is my first PR to pytorch, so please tell me if I'm forgetting anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150975
Approved by: https://github.com/soulitzer
2025-04-17 02:11:13 +00:00
6f9ffaa991 [c10d][fr] Fix script for uneven reduce scatter and update test cases (#151475)
Somehow the type string for reduce scatter is "REDUCE_SCATTER" not "REDUCESCATTER". This PR fixed it and added more test cases.

Differential Revision: [D73141245](https://our.internmc.facebook.com/intern/diff/D73141245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151475
Approved by: https://github.com/fegin
2025-04-17 02:11:08 +00:00
cd1db55817 Fix tensor_constant name collision in aot_export_module (#151123)
Summary:
When we have an exported program that looks like this:

```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, b__tensor_constant0: "f32[1]", ... c_lifted_tensor_0: "i64[925]", …. , tupleized_input_0_0: "f32[10, 2139]",

            clone: "i64[925]" = torch.ops.aten.clone.default(c_lifted_tensor_0);  c_lifted_tensor_0 = None

            index_select: "f32[10, 925]" = torch.ops.aten.index_select.default(tupleized_input_0_0, 1, clone);  clone = None
```

The graph after `aot_export_module` could have a name collision, notice that `_tensor_constant0` arg of `clone` is different from the  `_tensor_constant0`  in the input module .

```
def forward(self):
        arg9_1: "f32[10, 2139]"

        _tensor_constant0: "f32[1]" = self._tensor_constant0 # this should be int64, conflicted with the original _tensor_constant0, had a clone on this constant before lifting

        index: "f32[10, 925]" = torch.ops.aten.index.Tensor(arg9_1, [None, _tensor_constant0]);  _tensor_constant0 = None
```

This caused the `tensors used as indices must binary, int...` aoti error on PT2I dashboard because later we used `clone` as index.

We had this error because we created a new `_tensor_constant0` at [here](https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L403-L412), and the new `_tensor_constant0` overrides the original `_tensor_constant0` on the input Module in `_unlift_graph`. The `arg` for `clone` is created at `create_proxy` in `proxy.py`.

To fix this, we do a graph pass before we unlift the graph inputs to avoid name collision

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding

buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```

Differential Revision: D72761937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151123
Approved by: https://github.com/tugsbayasgalan, https://github.com/jingsh
2025-04-17 01:52:21 +00:00
bf92c9883b Refine host caching allocator (#151403)
# Motivation
This stack of PRs aims to generalize and improve PyTorch host allocator code.

This PR introduces a `DeleterFnPtr` template parameter to `CachingHostAllocatorInterface` to resolve circular dependency issues. This change allows for better code reuse and simplifies the implementation of host allocators.

# Additional Context
TODO:
- [ ] Unify host allocator related API
- [ ] Deprecate those device-specific legacy API
- [ ] Move `is_pinned` to host allocator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151403
Approved by: https://github.com/gujinghui, https://github.com/albanD
2025-04-17 01:50:47 +00:00
fb6ac2f161 [DCP] Add logging for _stateful_to_state_dict(), stage_state_dict(), and synchronize_staging() (#151320)
Summary: As titled.

Test Plan: CI

Differential Revision: D73040700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151320
Approved by: https://github.com/saumishr
2025-04-17 01:08:32 +00:00
300e0ee13c [Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756)
Reopen the previous stale closed PR https://github.com/pytorch/pytorch/pull/134192

We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144756
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/desertfire
2025-04-17 00:26:55 +00:00
2fd26925c4 improve noop elimination for view (#151095)
This PR improves noop elimination.

### View Noop

```python
>>> torch.Size([1,2,3]) == [1,2,3]
False
>>> torch.Size([1,2,3]) == (1,2,3)
True
```
So we add `tuple(size)` in `view_noop`.

Example:
```python
import torch

@torch.compile()
def f(x):
    batch_size = x.shape[0]
    x = x.transpose(1, 2) # (batch_size, 2, 3)
    x = x.reshape(batch_size, 2, 3) # noop
    return x

x = torch.randn((2,3,2))
f(x)

x = torch.randn((4,3,2))
f(x)
```

Before:
![image](https://github.com/user-attachments/assets/be488881-6c99-43a9-b088-fa481f675775)

After:
![image](https://github.com/user-attachments/assets/6d93be3d-128b-44d4-ad6a-d3d18e272329)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151095
Approved by: https://github.com/eellison
2025-04-16 23:55:32 +00:00
9a2624c712 Fix keepdim param optional description (#151197)
Fixes #151104

Fix optional description of `dim`  and `keepdim`, except `torch.quantile` which already fixed in #146485

## Test Result

### Before

![image](https://github.com/user-attachments/assets/69f1824d-3d15-407e-8c92-f25a22e16914)

### After

![image](https://github.com/user-attachments/assets/e5aac674-ab8f-4988-a5f1-7400c36bdc99)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151197
Approved by: https://github.com/mikaylagawarecki
2025-04-16 23:15:30 +00:00
9e6ad274dc Action for building docker binary builds (#151471)
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context

Uses calculate docker image with the new custom tag prefix, so the naming convention of the docker images is slightly different for images built on PR

based off of a582f04608/.github/workflows/build-manywheel-images.yml (L101)

Also moves the push of the docker images from inside the build scripts to inside the workflow

Currently not used anywhere, but the binary docker builds are very similar so I'm going to change them to use this instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151471
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/ZainRizvi
2025-04-16 23:01:35 +00:00
cd7bc60e11 Migrate to new theme (#149331)
- Migrate pytorch docs, cpp docs and functorch docs to the pytorch_sphinx_theme2
- Migrate index.rst to markdown and restructure to use high-level horizontal bar sections Python API, Developer Notes
- Added python-api.md which becomes the main container for the API docs. This file will be used to add all api references in the toctree. It would be great to have lint for this file: https://github.com/pytorch/pytorch/issues/150718
- Enabled mermaid sphinx extension and opengraph sphinx extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149331
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/albanD
2025-04-16 21:35:19 +00:00
1ffaa00ad7 [MPS] Migrate bitwise_not to unary operator (#151460)
That kills to birds with one stone:
 - Makes implementations more standartized (and faster for strided inputs/outputs)
 - Fixes bug strided inplace bitwise_not

I.e. before this change
```python
import torch
x=torch.arange(32, device="mps")
x[::2].bitwise_not_()
print(x)
```
produced
```
tensor([ -1,  -2,  -3,  -4,  -5,  -6,  -7,  -8,  -9, -10, -11, -12, -13, -14,
        -15, -16,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
         28,  29,  30,  31], device='mps:0')
```
after, it generates reasonable output
```
tensor([ -1,   1,  -3,   3,  -5,   5,  -7,   7,  -9,   9, -11,  11, -13,  13,
        -15,  15, -17,  17, -19,  19, -21,  21, -23,  23, -25,  25, -27,  27,
        -29,  29, -31,  31], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151460
Approved by: https://github.com/dcci, https://github.com/qqaatw, https://github.com/Skylion007
2025-04-16 21:34:45 +00:00
f252f9df5e Revert "[Openreg][PrivateUse1] Enable CI for openreg (#151007)"
This reverts commit abbca37fe882541e0259b43dd314a324180550ed.

Reverted https://github.com/pytorch/pytorch/pull/151007 on behalf of https://github.com/clee2000 due to At least test_record_event needs to also be skipped on dynamo too, its failing and then somehow causing a hang? https://github.com/pytorch/pytorch/actions/runs/14487625709/job/40637535027#step:25:73 ([comment](https://github.com/pytorch/pytorch/pull/151007#issuecomment-2810789483))
2025-04-16 21:05:17 +00:00
e0535e823f Revert "[Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091)"
This reverts commit e229ce34c4ab8cd4e2800227615be32fb362b1e6.

Reverted https://github.com/pytorch/pytorch/pull/151091 on behalf of https://github.com/clee2000 due to At least test_record_event needs to also be skipped on dynamo too, its failing and then somehow causing a hang? https://github.com/pytorch/pytorch/actions/runs/14487625709/job/40637535027#step:25:73 ([comment](https://github.com/pytorch/pytorch/pull/151007#issuecomment-2810789483))
2025-04-16 21:05:17 +00:00
5b5399bfcd [graph partition] reorder to reduce #partitions for simple dependencies (#150814)
This PR reduces #graph partitions by reordering nodes when the `should_partition` nodes have simple dependencies. Specifically, for `should_partition` nodes:
    a. If a node has no dependency or only depends on graph inputs: move to the front. Use case is when we move symints to cuda tensor for PaddedTensorSubclass
    b. If the only user of a node is OutputNode: move it to the end.

#### Example

The following example shows a padded tensor subclass use case where we copy symint to a cuda tensor (aka mask) in the middle of function. Reordering still generates 1 cudagraph by moving the mask to the front.

```python
import torch

torch._inductor.config.graph_partition = True

# Two reasons for this:
# 1. We want to reuse the same mask for many masked_fill calls
# 2. Prevent inductor from fusing this op into other ops (e.g. masked_fill)
#    so we can still reorder in scheduler
@torch.library.custom_op("mylib::create_mask", mutates_args=(), tags=(torch._C.Tag.cudagraph_unsafe,))
def create_mask(padded_size: int, original_size: int, device: torch.device) -> torch.Tensor:
    mask = torch.zeros((padded_size,), dtype=torch.bool, device=device)
    mask[original_size:] = True
    return mask

@create_mask.register_fake
def _(padded_size, original_size, device):
    return torch.empty((padded_size,), dtype=torch.bool, device=device)

def f(padded_tensor, original_tensor, weight):
    original_size = original_tensor.size()[0]
    padded_size = padded_tensor.size()[0]

    # element wise op so we don't care padding value
    padded_tensor = padded_tensor + 1
    padded_tensor = torch.nn.functional.relu(padded_tensor)

    # dot product requires padding with 0
    dot_res = padded_tensor.dot(weight)
    padded_tensor += dot_res

    # min requires padding with inf, so we create mask now
    mask = create_mask(padded_size, original_size, padded_tensor.device)
    min_res = torch.min(
        torch.ops.aten.masked_fill(padded_tensor, mask, float("inf"))
    )

    # max requires padding with inf. we can reuse previous mask
    max_res = torch.max(
        torch.ops.aten.masked_fill(padded_tensor, mask, -float("inf"))
    )

    return min_res+max_res+padded_tensor

compiled_f = torch.compile(f, mode="reduce-overhead")

def run(padded_size, original_size):
    padded_tensor = torch.randn(padded_size, device="cuda")
    padded_tensor[original_size:] = 0
    original_tensor = torch.randn(original_size, device="meta")

    weight = torch.randn(padded_size, device="cuda")
    eager_out = f(padded_tensor, original_tensor, weight)
    compiled_out = compiled_f(padded_tensor, original_tensor, weight)
    assert torch.allclose(eager_out[0], compiled_out[0])
    assert torch.allclose(eager_out[1], compiled_out[1])

# new cudagraph
run(8, 4)

# new cudagraph due to recompile
run(8, 6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150814
Approved by: https://github.com/eellison
2025-04-16 20:49:20 +00:00
a582f04608 Revert "[ez] Make relaxed constraint error message more user friendly (#151407)"
This reverts commit bc934f57d7c14b07e7497eb72a90d893270bc662.

Reverted https://github.com/pytorch/pytorch/pull/151407 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/151407#issuecomment-2810716135))
2025-04-16 20:40:22 +00:00
607443b16b [compile][compile time traces] Add more dynamo traces (#151357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151357
Approved by: https://github.com/williamwen42
ghstack dependencies: #151330, #151256
2025-04-16 20:37:08 +00:00
8e373592c8 [aot autograd][logging] Profile large missing gaps in compile time tracing (#151256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256
Approved by: https://github.com/bdhirsh, https://github.com/masnesral
ghstack dependencies: #151330
2025-04-16 20:37:08 +00:00
c58b3f6be3 [invoke_subgraph][inductor] Run pre and post grad passes on invoke_subgraph (#151330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151330
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-04-16 20:37:01 +00:00
4c4a5df73b Allow to run flex_attention on HPU (#148656)
HPU specific implementation details are to be located in out-of-tree HPU library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148656
Approved by: https://github.com/drisspg
2025-04-16 19:49:15 +00:00
9400f53903 [Inductor] Broadcast to range tree shape before block pointer store (#151399)
# Feature

This fixes a bug related to block pointer stores. Since Triton's block pointer stores don't support implicit broadcasting, in certain cases we need to generate a `reshape->broadcast->reshape` pattern to ensure that the tensor being stored has the same shape as the block pointer. This happens when the block indexing expression involves strides of 0 or dimensions of 1, both of which we eliminate from the block pointer.

The existing logic missed an important edge case.  We may need a broadcast prior to the first `reshape` of this pattern, in case the tensor comes from a load with implicit broadcasting. For example, if the range trees have shape `[YBLOCK, XBLOCK]`, but the load has a shape `[1, XBLOCK]`, we need to broadcast this to `[YBLOCK, XBLOCK]` prior to storing. See the example kernel below, which comes from `expand` -> `clone` with 3D tiling. The load has an implicit broadcast, and the store has a reshape. Thus, we need to insert an explicit broadcast between them.

```
@triton.jit
def triton_poi_fused_clone_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    znumel = 32
    ynumel = 1
    xnumel = 32
    zoffset = tl.program_id(2) * ZBLOCK
    zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None]
    zmask = zindex < znumel
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
    ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1)
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :]
    xmask = xindex < xnumel
    x1 = xindex
    z0 = zindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[32], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :]
    tl.store(tl.make_block_ptr(out_ptr0, shape=[32, 32], strides=[32, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp0, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```

The tricky part is that we don't want to emit redundant broadcasts in the store. This PR reworks the logic a bit to make sure we don't emit a second broadcast unless it actually changes the shape.

# Test plan

Added a CI test for this case, which would fail on trunk. Checked that only one broadcast was emitted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151399
Approved by: https://github.com/jansel, https://github.com/eellison
2025-04-16 19:03:40 +00:00
eqy
17bf59340c [cuSPARSE][B200] Bump tolerances for test_sparse_csr matvec (#148721)
Small tolerance bump for blackwell (appears to use same kernel as prev. arches)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148721
Approved by: https://github.com/nWEIdia, https://github.com/ngimel
2025-04-16 18:44:18 +00:00
1f29190b59 [dynamo] unimplemented -> unimplemented_v2 in variables/builtin.py (#151145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151145
Approved by: https://github.com/Skylion007, https://github.com/StrongerXi, https://github.com/jansel, https://github.com/zou3519
2025-04-16 17:16:05 +00:00
bc934f57d7 [ez] Make relaxed constraint error message more user friendly (#151407)
Fixes #151356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151407
Approved by: https://github.com/Skylion007
2025-04-16 17:00:06 +00:00
cedcdda0ed Add ccode for CeilToInt and IntTrueDiv (#151375)
Summary: As titled

Test Plan: Test in D73052653 -- shape calculator generates successfully

Differential Revision: D73073845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151375
Approved by: https://github.com/kalpit-meta-1, https://github.com/Skylion007
2025-04-16 16:47:55 +00:00
6a3a6d22dc Revert "[dynamo] context manager/decorator for dynamo config patching during tracing (#150586)"
This reverts commit 40ce4fb24a536d175348df876f61956d4945778e.

Reverted https://github.com/pytorch/pytorch/pull/150586 on behalf of https://github.com/clee2000 due to broke some inductor tests? inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/14486513628/job/40635178179) [HUD commit link](40ce4fb24a), bad TD ([comment](https://github.com/pytorch/pytorch/pull/150586#issuecomment-2810064322))
2025-04-16 16:13:47 +00:00
0c77af3576 [MPSInductor] Add pow, log2 and FloorToInt ops (#151449)
That enables `test_pow_by_natural_log2_dynamic_shapes_mps`

Not sure why log2 printer function suffix is `OpaqueUnaryFn_log2`, rather than just `log2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151449
Approved by: https://github.com/jansel
2025-04-16 15:56:21 +00:00
e229ce34c4 [Openreg][PrivateUse1] Fix releasing tensor issue when using pin_memory (#151091)
As the title stated.

Related PR: https://github.com/pytorch/pytorch/pull/147066

Co-authored-by: Zhenbin Lin <lin-zhenbin@qq.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151091
Approved by: https://github.com/albanD
ghstack dependencies: #151005, #151007
2025-04-16 13:12:17 +00:00
c7400d0026 [inductor][comms] skip reorder_for_locality for wait nodes (#150074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150074
Approved by: https://github.com/eellison, https://github.com/bdhirsh
ghstack dependencies: #150258
2025-04-16 10:18:33 +00:00
159d8a14a6 [inductor][comms] fix node_summary for composite scheduler nodes (#150258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150258
Approved by: https://github.com/yf225
2025-04-16 10:18:33 +00:00
41c97a72a1 [export] Add draft-export to error msg (#151065)
Given an exception in torch.export, I want to try/catch it to add the message "hey try out draft-export!". Currently I only add this message for errors that draft-export is known to fix, like DataDependentErrors, ConstraintViolationErrors, and no fake impl.

Originally the error message looks like:
```
  File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl
    raise RuntimeError(
RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl.
```

Now, the error msg now looks something like:
```
  File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl
    raise RuntimeError(
RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl.

The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can rerun your program with the `DRAFT_EXPORT=1` envvar, or replace your `export()` call with `draft_export()`.
```

In python versions >= 3.11, we can use `exception.add_note` to add to the error message. However with previous versions I did a hack to modify `e.args`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151065
Approved by: https://github.com/pianpwk
ghstack dependencies: #151051
2025-04-16 08:56:02 +00:00
84e633e09d [export] Make draft-export predispatch=True by default (#151051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151051
Approved by: https://github.com/pianpwk
2025-04-16 08:56:02 +00:00
a5c61668d7 fix ambiguous error message (#150086)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150086
Approved by: https://github.com/anijain2305
2025-04-16 08:48:05 +00:00
0a489f924d Fix: missing () in generated runtime assert c++ code (#151171)
Address one of the issues in https://github.com/pytorch/pytorch/issues/151127
generated code used to be
not a==5 or b==5

should be
not (a==5 or b==5)

address one of the issues in the comments of Address one of the issues in https://github.com/pytorch/pytorch/issues/151127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151171
Approved by: https://github.com/aorenste, https://github.com/eellison
ghstack dependencies: #151170
2025-04-16 08:10:17 +00:00
55595e0c85 Fix Issues in deferring runtime assertions. (#151170)
This PR fix two bugs:
1)  Update self.bound_unbacked_symbols before emitting runtime asserts :
set self.bound_unbacked_symbols before emitting runtime asserts to include runtime asserts depending on the current node

2) In the pass that remove unused graph inputs, we should not remove symbols that are used by runtime assertions.

Address some of the issues in https://github.com/pytorch/pytorch/issues/151127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151170
Approved by: https://github.com/bobrenjc93, https://github.com/eellison
2025-04-16 08:10:17 +00:00
abbca37fe8 [Openreg][PrivateUse1] Enable CI for openreg (#151007)
Changes:
- move test_openreg.py from test/cpp_extensions/open_registration_extension/ to test/
- update README.md for openreg
- enable CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151007
Approved by: https://github.com/albanD
ghstack dependencies: #151005
2025-04-16 07:55:51 +00:00
a9dbbe1aee [OpenReg][PrivateUse1] Refactoring the csrc files of pytorch_openreg (#151005)
As the title stated.

**Changes:**
- Remove unnecessary header file
- Remove unnecessary registry logic about PrivateUse1HooksRegistry,such as TORCH_DECLARE_REGISTRY, C10_DEFINE_REGISTRY, etc,.
- using static + global variable to do initialization instead of call_one

**Next Step:**
Enable test_openreg.py in CI/CD to guard the quality of PrivateUse1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151005
Approved by: https://github.com/albanD
2025-04-16 07:55:50 +00:00
40ce4fb24a [dynamo] context manager/decorator for dynamo config patching during tracing (#150586)
Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code.

The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature.

Implementation:
- Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch
- Dynamo doesn't trace into the context manager but updates config at compile time
- Correctness is based on our correctness for handling supported context managers
- Implementation is inspired by how `GradModeVariable` is implemented.

Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach)

See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches.

NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305
2025-04-16 06:49:58 +00:00
daf2ccf023 [custom ops] Fix destroy function (#151299)
Summary:
D72906445 seemed to cause a SIGABRT when running the test in the test plan. The change I narrowed it down to was where in fake_impls the [`deregister_fake_kernel` no longer calls `lib.destroy`](https://github.com/pytorch/pytorch/pull/150806/files#diff-7fd3f4222276c63b91f3a895530bb5efe137fd23165b48f25afcf3c06a5d2a8fL65-L69).

Calling `lib.destroy` in that handle results in a maximum recursion error where someone calls library.destroy which calls the handle which calls back to library.destroy.

So I compared the implementation of this _del_library and lib.destroy and it seemed like the main thing different was deleting `self.m`. So adding that fixed my issue!

Side note, I feel like we can combine `_del_library` and `library._destroy`? But I won't do it in this diff to make sure we don't break too many things 😅

Test Plan:
`buck test 'fbcode//mode/opt' fbcode//aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests -- --exact 'aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests - aiplatform.gmpp.bulk_eval.reader.service.tests.reader_service_handler_tests.ReaderServiceHandlerTests: test_add_preproc_output_into_queue'`
https://www.internalfb.com/intern/testinfra/testrun/10977524170296078

Differential Revision: D73017613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151299
Approved by: https://github.com/zou3519
2025-04-16 06:18:09 +00:00
585d03fa39 Record how many parameters we're parsing within dynamo (#148508)
This allows us to track how many paramaters we have in compilations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148508
Approved by: https://github.com/jansel, https://github.com/anijain2305

Co-authored-by: Sam Larsen <slarsen@meta.com>
2025-04-16 06:15:11 +00:00
b4cee2bf57 [executorch hash update] update the pinned executorch hash (#151280)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151280
Approved by: https://github.com/pytorchbot
2025-04-16 05:39:06 +00:00
107121dfad [AOTInductor] Add interface for user managed buffer in package api. (#151325)
Summary:
https://github.com/pytorch/pytorch/pull/151141
We add interface for user managed buffer in the package api.

Test Plan:
Included in commit.]

Reviewed By: henrylhtsang

Differential Revision: D72985440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151325
Approved by: https://github.com/angelayi
2025-04-16 04:25:40 +00:00
82200e33b5 Make torch._chunk_cat support non-contiguous inputs (#151263)
Currently, `torch._chunk_cat` only supports contiguous inputs (due to `.view()` usage in `_pad_chunk()` supporting only contiguous tensor). This doesn't work for internal models where there can be non-contiguous input tensors:

- size=[8192, 16416], stride=[16448, 1]  # stride[0] is larger than size[1]
- size=[1152, 384], stride=[1, 1152]  # column-major tensor

In this PR, we relax the assumption on contiguous input tensor, by switching from `.view()` to `.reshape()`. Note that since `.reshape()` will try to use `.view()` under the hood whenever possible, this should not cause regression to existing use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151263
Approved by: https://github.com/BoyuanFeng
2025-04-16 04:18:46 +00:00
30101aa450 [c10d][fr] Add counters for FR dump and reduce its timeout to finish dump before watchdog timeout (#151329)
After https://github.com/pytorch/pytorch/pull/150652, we still see some ranks missing dumps. Upon looking further, the case is that FR dump timed out for its first attempt:
watchdog thread: notify FR dump -> wait for 1 mins -> throw watchdog timeout -> notify elastic to kill process
FR dump thread: received FR dump signal -> timeout after 1 mins with first attempt -> started 2nd attempt -> got killed.

So we want to make the FR dump timeout shorter, in reality, the log shows that the dump finished within one sec. Even if we consider a very slow speed like 200K/s the usual size FR (1MB at most) takes around 5 secs, so 15 secs is like 3 times buffer.

Also we still let watchdog sleep for 1 min so that we can wait enough time for two dump to timeout and the following check like GIL checker to execute.

Also, if we get stuck in getting GIL or cuda hang, 15 seconds should be enough to detect the hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151329
Approved by: https://github.com/fegin
2025-04-16 03:48:03 +00:00
3a90fd481e fix test_einsum: use initialized values (#151363)
Summary: `empty` uses uninitialized values so that could be NaNs, thus, the assert_close kept failing in FBCode.

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints_cpu -- --exact 'caffe2/test/inductor:unbacked_symints_cpu - test_einsum_cpu (caffe2.test.inductor.test_unbacked_symints.TestUnbackedSymintsCPU)' --env TORCH_LOGS="+output_code" --print-passing-details --env TORCH_LOGS_FORMAT="%(filename)s:%(lineno)s: %(message)s"
```

Differential Revision: D73067722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151363
Approved by: https://github.com/Camyll

Co-authored-by: Camyll Harajli <camyllh@meta.com>
2025-04-16 03:10:29 +00:00
6124dabd30 [CI][NoOp] Update skip reason for argmin_with_nan (#151374)
Which is https://github.com/pytorch/pytorch/issues/130295 (i.e. torch.compile produces correct results, but eager is not)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151374
Approved by: https://github.com/dcci
2025-04-16 02:33:20 +00:00
ae53510b9e Fix setUpClass() / tearDownClass() for device-specific tests (#151129)
Finishes up the work started in #121686 + adds test

Update: this was not as straightforward as I originally imagined. Context below.

**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.

**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.

Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.

(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:

070f389745/test/test_ops.py (L218-L221)

(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:

070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)

070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)

To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).

**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)

To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
2025-04-16 02:18:42 +00:00
067a7b1d4a Disable -Werror for s390x test module compilation (#150413)
This change should make nightly testsuite green again for s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150413
Approved by: https://github.com/seemethere
2025-04-16 02:15:17 +00:00
aacac88bee [ROCM] Fix in-place aten sum with specialized templated kernels. (#151230)
We noticed a regression when doing aten.sum in-place (a+=b) and the type of the output is not the same as the functor.

Co-authored by: Jerry Mannil <jerry.mannil@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151230
Approved by: https://github.com/jeffdaily
2025-04-16 02:07:46 +00:00
cyy
cadd832c19 [1/N] Use std::string_view in torchgen (#146403)
Moves remaining c10::sv to std::sv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146403
Approved by: https://github.com/albanD
2025-04-16 01:50:22 +00:00
dd11613f94 [cutlass backend][experimental] Try out presets for cutlass instead of searching all configs (#151255)
Differential Revision: [D72668861](https://our.internmc.facebook.com/intern/diff/D72668861/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151255
Approved by: https://github.com/mlazos
2025-04-16 01:48:06 +00:00
532025fbd0 [cutlass backend][ez] Ban FP32 output dtype from using CUTLASS GEMM backend (#151279)
FP32 not supported: https://github.com/pytorch/pytorch/issues/145952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151279
Approved by: https://github.com/ColinPeppler
2025-04-16 01:12:18 +00:00
8780d18f64 [ONNX] Add a comment for handling bf16/fp8 tensor to numpy conversion (#151371)
Follow up of https://github.com/pytorch/pytorch/pull/151259
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151371
Approved by: https://github.com/titaiwangms
2025-04-16 00:49:38 +00:00
4bbb61812c [BE][1/2] Move original_weights_lookup attribute to constant (#151241)
Summary: As title. Cleaning usages by using global constant.

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_keep_original_weights (quantization.fx.test_quantize_fx.TestQuantizeFx)'`

Differential Revision: D72892815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151241
Approved by: https://github.com/Skylion007, https://github.com/hl475
2025-04-16 00:41:25 +00:00
44a522dd78 [BE] Fix extra-semi warning in attention.cpp (#151367)
Introduced by https://github.com/pytorch/pytorch/pull/149512

Before this change, following warning was generated
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/transformers/attention.cpp:452:71: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  452 | REGISTER_HPU_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_meta);
      |                                                                       ^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151367
Approved by: https://github.com/drisspg
2025-04-16 00:31:45 +00:00
8e6415fd32 [cutlass backend] "Fix" FlexibleLayout (#151284)
So Horace was right, Triton does fix the layout when rendering the template (i.e. roughly at the same time).

You can double check that running the unit test with gemm backend as "TRITON,CUTLASS". You will notice that the layout is fixed if we have triton in gemm backend, but flexible if triton is not there.

code pointer: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/select_algorithm.py#L927

In the future, we should remove `fix_op_layout` from class CUTLASSGemmTemplate. But maybe we can monitor it for a bit first.

Differential Revision: [D72996143](https://our.internmc.facebook.com/intern/diff/D72996143/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151284
Approved by: https://github.com/ColinPeppler
2025-04-16 00:10:52 +00:00
e55eb5c870 [Cutlass] Integrate EVT codegen into 3x gemm template (#150346)
Previously merged:
* #150345
* #150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150346
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #150344, #150345
2025-04-16 00:08:22 +00:00
3cf0e2d8ec Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-15 23:38:15 +00:00
9917feff50 [ONNX] Produce correct dtypes for bf16/f8 in IR TorchTensor (#151259)
Split the changes from https://github.com/pytorch/pytorch/pull/151069 to address https://github.com/microsoft/onnxscript/issues/2187, where the output np arrays do not have the correct ml_dtypes types as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151259
Approved by: https://github.com/titaiwangms
2025-04-15 23:21:04 +00:00
331423e5c2 Fix tensorpipe compilation with clang-17 (#151344)
By suppressing `missing-template-arg-list-after-template-kw` warning, which seems to be required to compile Google's libnop, which is in a semi-abandoned state now
```
In file included from /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/variant.h:21:
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:241:30: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  241 |     index_ = value_.template Construct(std::forward<Args>(args)...);
      |                              ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:258:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  258 |     if (!value_.template Assign(TypeTag<T>{}, index_, std::forward<U>(value))) {
      |                          ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:265:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  265 |     if (!value_.template Assign(index_, std::forward<T>(value))) {
      |                          ^
3 errors generated.
```

Fixes https://github.com/pytorch/pytorch/issues/151316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151344
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere
2025-04-15 22:18:06 +00:00
98b1e82ba8 Revert "Fix setUpClass() / tearDownClass() for device-specific tests (#151129)"
This reverts commit bd4cf30e31a2a0b0a57f54c7eedd3a39d5778cbe.

Reverted https://github.com/pytorch/pytorch/pull/151129 on behalf of https://github.com/jbschlosser due to flex attention tests failing ([comment](https://github.com/pytorch/pytorch/pull/151129#issuecomment-2807632119))
2025-04-15 22:07:25 +00:00
e1d8b3f838 [inductor] Check NoneLayout in update_zero_dim_cpu_tensor (#151321)
Summary:
This fixes the error in https://fb.workplace.com/groups/1075192433118967/permalink/1640802133224658/
I tried really hard but I couldn't come up with a test case to repro the issue, but I confirmed with the OP that this issue has been fixed.
```
Traceback (most recent call last):
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 746, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1343, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1232, in codegen_and_compile
    compiled_module = graph.compile_to_module()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2087, in compile_to_module
    return self._compile_to_module()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2095, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2002, in codegen
    self._update_scheduler()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 1996, in _update_scheduler
    self.scheduler = Scheduler(self.operations)
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1954, in __init__
    self._init(nodes)
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1974, in _init
    self.update_zero_dim_cpu_tensor()
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 4433, in update_zero_dim_cpu_tensor
    and buffer.get_size() == []
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3903, in get_size
    return [*self.get_layout().size]
  File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3914, in get_layout
    raise NotImplementedError(type(self.layout).__name__)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NotImplementedError: NoneLayout
```

Test Plan: OP said the issue is fixed

Differential Revision: D72575808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151321
Approved by: https://github.com/BoyuanFeng
2025-04-15 21:58:09 +00:00
4518b30680 Clarify that x and dx are mutually exclusive in torch.trapezoid doc (#151190)
This PR addresses [#151105](https://github.com/pytorch/pytorch/issues/151105) by stating that x and dx are mutually exclusive parameters in torch.trapezoid()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151190
Approved by: https://github.com/soulitzer
2025-04-15 21:42:05 +00:00
630cf46039 [Cutlass] Codegen for EVT Epilogue (#150345)
Previously merged:
* #150344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150345
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #150344
2025-04-15 21:31:21 +00:00
27ef3f6cdc [ROCm][CI/CD] Create ROCm6.4 magma tarball (#151345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151345
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-15 21:12:48 +00:00
71e7dcda87 [c10d][fr] Record each individual collective being coalesced (#151238)
During the record of FR for coalesced collectives we are not consistent. For P2P ops, we log individual collectives into FR but for non-p2p ops, we don't do that. This PR is trying to make non-P2P also log individual collective into FR so that we can use script to check correctness of ops for each one of collectives coalesced.

Also the added unit test also address the unit test ask in the comment in https://github.com/pytorch/pytorch/pull/150863?fbclid=IwZXh0bgNhZW0CMTEAAR4a5Rd_JyJlrbKZcacbIv5WX5b4MqBRNn0hpgl-VTSD0eeXRlPZ9Ty_CPOYhQ_aem_ALEG1ibRajwie-rn1B4n5w#pullrequestreview-2751254224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151238
Approved by: https://github.com/d4l3k, https://github.com/wconstab
ghstack dependencies: #151247
2025-04-15 20:56:37 +00:00
ae648f047c [c10d][fr] Enable FR analysis script for rest of all coalesce op (#151247)
We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label.

Added more unit test for slow path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-15 20:53:03 +00:00
f98150fc8e Warn user of existing lock file to avoid infinite waiting (#149382)
Sometimes the python script didn't exit normally and the lock file remains in the path. In this case, the `file_baton.py` may sleep forever waiting for the lock file to release. This PR will add a warning to show the existing lock file path, let the user better understand which file to delete when the waiting time is too long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149382
Approved by: https://github.com/soulitzer
2025-04-15 20:25:29 +00:00
bd4cf30e31 Fix setUpClass() / tearDownClass() for device-specific tests (#151129)
Finishes up the work started in #121686 + adds test

Update: this was not as straightforward as I originally imagined. Context below.

**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.

**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.

Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.

(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:

070f389745/test/test_ops.py (L218-L221)

(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:

070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)

070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)

To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).

**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)

To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
2025-04-15 20:13:26 +00:00
d77e0cddfe [Cutlass] Import cutlass python API for EVT (#150344)
This imports the pieces of the cutlass python API that are needed for python EVT tracing. It builds on existing importing for cutlass_library. Once EVT tracing has been added to cutlass_library (should be later this year) this can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150344
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-04-15 20:11:40 +00:00
91923f0ee1 [inductor] disable alignment asserts in fbcode (#151274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151274
Approved by: https://github.com/Mingming-Ding, https://github.com/Microve, https://github.com/eellison
2025-04-15 19:59:54 +00:00
a2632d5241 [HOP] Reworked DispatchKey.Autograd (#151107)
This PR intends to rework the dispatching of the autograd key.
I.e., currently the DispatchKey.Autograd of the HOPs was triggered, even if non of the operands of the HOP have `requires_grad=True`. With this rework, the autograd is bypassed if non of the operands require gradients and only invoked if any of the operands require gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151107
Approved by: https://github.com/ydwu4
2025-04-15 19:55:46 +00:00
19a33b20c2 [ROCm][CI/CD] create ROCm 6.4 images, part 1, skip magma tarball (#151236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151236
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-15 19:45:15 +00:00
8d5f7ab06c Replace all random is_fbcode imports to environment (#151283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151283
Approved by: https://github.com/masnesral, https://github.com/Skylion007
2025-04-15 19:42:58 +00:00
eea4a7b424 update expected results for comptime benchmark (#151319)
This PR https://github.com/pytorch/pytorch/pull/150594 bumped the benchmark up by ~1%, a bit under our 1.5% "regression" mark.

Modeled this PR after https://github.com/pytorch/pytorch/pull/144274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151319
Approved by: https://github.com/jamesjwu, https://github.com/laithsakka
2025-04-15 19:40:13 +00:00
e45a6a9300 [inductor][test] Disable Triton GEMM backend tests for SM89 (#150485)
Motivation: To deprecate a silent fallback behavior https://github.com/pytorch/pytorch/issues/150390

Problem: On SM89, Trition GEMM backend isn't working. This seems to be a pre-existing issue. I don't have access to SM89 to debug further.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150485
Approved by: https://github.com/xmfan, https://github.com/eellison
2025-04-15 19:03:52 +00:00
f1adf22b5f improve noop elimination for slice and slice_scatter (#151175)
Improves noop elimination for `slice` and `slice_scatter`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151175
Approved by: https://github.com/zou3519
2025-04-15 18:56:50 +00:00
d7050ef48b [CI] Run test_torchinductor for MPS device (#150821)
There are only 118 failures atm, mark them all with xfail to avoid new regressions

Add `xfail_if_mps_unimplemented` decorator to distinguish between tests that call unimplemented eager op vs ones that fail for some other reason.

Added `aten._scaled_dot_product_attention_math_for_mps` fallback to make test behavior consistent between MacOS-15 (where falback is in place) and MacOS-14

Weird MacOS-14 specific skips:
- test_torchinductor.py::GPUTests::test_cat_extern_kernel_mps
- test_torchinductor.py::GPUTests::test_sort_transpose_mps (likely an eager bug)
- test_torchinductor.py::GPUTests::test_unaligned_input_mps

Numerous MacOS-13 skips, including few eager hard crashes, for example running `test_torchinductor.py::GPUTests::test_scatter5_mps` causes
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayScatter.mm:309: failed assertion `Rank of destination array (1) must be greater than or equal to inner-most dimension of indices array (3)'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150821
Approved by: https://github.com/ZainRizvi, https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272, #151282, #151288
2025-04-15 18:42:39 +00:00
7e5f6dcf7f Add @requires_multicast_support to test_multimem_all_gather (#151227)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151227
Approved by: https://github.com/jeffdaily
2025-04-15 18:41:12 +00:00
83d88d128d [reland] Make export._trace._WrapperModule work in strict mode (#146919) (#151264)
Summary:

as title

`export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function.

We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule`

Fixes https://github.com/pytorch/pytorch/issues/146867

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module
```

Differential Revision: D72986826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151264
Approved by: https://github.com/angelayi
2025-04-15 18:35:34 +00:00
61f127aac5 [Export] fix automatically convert instances of _check(u>=0) to check_is_size() (#148844)
Fixes #148826

Understanding:

1. PyTorch should automatically convert instances of _check(u>=0) to check_is_size()
2. The export mechanism should suggest using check_is_size() instead of _check(u>=0) when applicable

Changes made:
1. Added a helper function to detect non-negative checks: is_non_negative_check
2. Modified the suggestion logic in _suggest_torch_checks to detect and handle non-negative checks
3. unit tests test_is_non_negative_check_function, test_suggest_torch_checks_with_non_negative_check, and test_suggest_torch_checks_with_regular_check

unit tests:

base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_suggest_torch_checks_with_non_negative_check
=================================== test session starts ==================
platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/export/test_export.py .                                                                                           [100%]

======================== 1 passed in 1.67s =======================
(base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_suggest_torch_checks_with_regular_check
======================= test session starts =================
platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/export/test_export.py .                                                                                           [100%]

================================= 1 passed in 1.61s ================
(base) sany@sandishs-Laptop pytorch % pytest test/export/test_export.py::TestExport::test_is_non_negative_check_function
================================ test session starts =============
platform darwin -- Python 3.9.19, pytest-7.3.2, pluggy-1.5.0
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: xdoctest-1.1.0, cpp-2.3.0, flakefinder-1.1.0, anyio-4.6.0, rerunfailures-14.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/export/test_export.py .                                                                                           [100%]

======================= 1 passed in 1.62s =========================
(base) sany@sandishs-Laptop pytorch %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148844
Approved by: https://github.com/laithsakka
2025-04-15 17:41:11 +00:00
74f6bc28a7 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit c9aef508984a31f03821eaad381468673ef29c0a.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/Camyll due to breaking internal builds with torch module not found error ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2806975267))
2025-04-15 17:35:59 +00:00
c0a0761871 [Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.

# Feature

This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below.

![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b)

The IR currently supports the following ops:
- All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.)
- Reinterpret views (`ReinterpretLine`)
- Kernel definitions (`KernelDefinitionLine`)
- Calls to defined kernels (`KernelCallLine`)
- Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`)
- Ops with multiple outputs (`MultiOutputLine`)
- Tensor cleanup at the end of a graph (`FreeLine`)
- Leaving comments in code (`CommentLine`)

There are two main motivations for this refactor:
1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`.
2. This design will hopefully promote stronger modularity and encapsulation.
   a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends.
   b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code.

# Implementation details

The implementation mainly consists of separating direct C++/Python codegen into two phases:
 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do.
 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code.

The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to  `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros.

# Test plan

Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage.

The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage.

# Future directions

These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files:
 - User-defined Triton kernels.
 - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.)
 -  Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`.
 -  Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants.

These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes.

One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor.

Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458
Approved by: https://github.com/eellison
2025-04-15 17:28:36 +00:00
8f440a8e70 don't return logits for benchmark script (#151075)
PT2 benchmark scripts has a pattern like:
```
    def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
        cloned_inputs = clone_inputs(inputs)
        self.optimizer_zero_grad(mod)
        with self.autocast(**self.autocast_arg):
            pred = mod(**cloned_inputs)
            loss = self.compute_loss(pred)
        self.grad_scaler.scale(loss).backward()
        self.optimizer_step()
        if collect_outputs:
            return collect_results(mod, pred, loss, cloned_inputs)
        return None
```
for training.

The collect_outputs argument is True only for accuracy testing and it's false for performance testing.

For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow.

A few bad things if we keep logits here
1. the peak memory will be higher since the logits is large and we can not release its memory earlier.
2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph

Actually I think it's fine to not return logits at all.
- For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients.
- Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us.

On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d))

Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a)
- HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement
- I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151075
Approved by: https://github.com/eellison, https://github.com/jansel
2025-04-15 17:13:00 +00:00
7d205b22b5 [profiler][retry] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#151124)
Retry of https://github.com/pytorch/pytorch/pull/150957, which was reverted due to internal meta failures

Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604

Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed.

This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this.

Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72842114/)!

Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151124
Approved by: https://github.com/sraikund16
2025-04-15 16:11:49 +00:00
c5de6ff079 Remove ls from filesystem base (#151117)
Summary: User reported issue where they are inheriting from filesystembase but don't have the ls method which was added in the PR https://github.com/pytorch/pytorch/pull/150701#discussion_r2039840129. Removing the method from the base class but keeping it in derived class

Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage

Differential Revision: D72867722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151117
Approved by: https://github.com/Skylion007, https://github.com/lw
2025-04-15 14:45:20 +00:00
f1f18c75c9 Gracefully handle optree less than minimum version, part 2 (#151257)
If optree is less than the minimum version, we should pretend it doesn't
exist.

The problem right now is:
- Install optree==0.12.1
- `import torch._dynamo`
- This raise an error "min optree version is 0.13.0"

The fix is to pretend optree doesn't exist if it is less than the min
version.

There are ways to clean up this PR more (e.g. have a single source of
truth for the version, some of the variables are redundant), but I am
trying to reduce the risk as much as possible for this to go into 2.7.

Test Plan:

I verified the above problem was fixed. Also tried some other things,
like the following, which now gives the expected behavior.
```py
>>> import torch
>>> import optree
>>> optree.__version__
'0.12.1'
>>> import torch._dynamo
>>> import torch._dynamo.polyfills.pytree
>>> import torch.utils._pytree
>>> import torch.utils._cxx_pytree
ImportError: torch.utils._cxx_pytree depends on optree, which is
an optional dependency of PyTorch. To u
se it, please upgrade your optree package to >= 0.13.0
```

I also audited all non-test callsites of optree and torch.utils._cxx_pytree.
Follow along with me:

optree imports
- torch.utils._cxx_pytree. This is fine.
- [guarded by check] f76b7ef33c/torch/_dynamo/polyfills/pytree.py (L29-L31)

_cxx_pytree imports
- [guarded by check] torch.utils._pytree (changed in this PR)
- [guarded by check] torch/_dynamo/polyfills/pytree.py (changed in this PR)
- [guarded by try-catch] f76b7ef33c/torch/distributed/_functional_collectives.py (L17)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/_op_schema.py (L15)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/_dispatch.py (L35)
- [guarded by try-catch] f76b7ef33c/torch/_dynamo/variables/user_defined.py (L94)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/experimental/_func_map.py (L14)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151257
Approved by: https://github.com/malfet, https://github.com/XuehaiPan
2025-04-15 13:08:26 +00:00
12cb11a268 [Inductor UT] Refactor FlexAttention UT and add CPU tests (#144953)
This PR extends and refines all rest UTs for CPU and more devices in `test/inductor/test_flex_attention.py`  and `test/inductor/test_flex_decoding.py`, as a follow-up to https://github.com/pytorch/pytorch/pull/141453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144953
Approved by: https://github.com/drisspg
2025-04-15 12:44:49 +00:00
2180e87d7c [fbgemm_gpu] Incorporate Torch DSA (#151148)
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1035

X-link: https://github.com/pytorch/FBGEMM/pull/3950

- Incorporte the PyTorch DSA infrastructure into the FBGEMM kernel launcher
  utility

Test Plan:
```
# Nvidia
buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder
buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck
buck2 run 'fbcode//mode/opt'  -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=a100  -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher

# AMD
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:split_embeddings_utils
```

Differential Revision: D72759030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151148
Approved by: https://github.com/huydhn
2025-04-15 11:34:04 +00:00
70e7b76707 [AOTInductor] Add Python interface for user managed buffer. (#151141)
Summary: Add pybind for user managed buffer in update_constants_buffer.

Test Plan:
Included in commit.
```
python test/inductor/test_aot_inductor.py -k user_managed
```

Differential Revision: D72892310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151141
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire
2025-04-15 09:36:30 +00:00
bd9c436c99 [Intel GPU][PT2E] Register qconv impls to general qconv_pointwise schema (#151092)
# Motivation
Refer to https://github.com/pytorch/pytorch/pull/150751, general scheme for `qconv_pointwise` is added and `qconv2d_pointwise` is removed in callers. This PR registers the XPU backend implementations to this operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151092
Approved by: https://github.com/EikanWang, https://github.com/guangyey
2025-04-15 08:42:14 +00:00
a756c50315 [Intel GPU] Avoid using fp32 in sdp math path when benchmark performance. (#150996)
sdp on xpu will fallback to math path in some cases (i.e. training). In dynamo benchmark, we prefer to use fp16 for better performance. Although `allow_fp16_bf16_reduction_math_sdp` is under backends.cuda, its implementation is for all device.

I didn't add if device == xpu here, I suppose cuda devices will not run into math path anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150996
Approved by: https://github.com/drisspg, https://github.com/EikanWang
2025-04-15 08:08:01 +00:00
ccfce9ae86 Fix score_mod.py dynamic max autotune for backward (#151270)
Same as https://github.com/pytorch/pytorch/pull/148991 but this PR fixes the backward path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151270
Approved by: https://github.com/drisspg, https://github.com/bobrenjc93
2025-04-15 06:33:37 +00:00
afaadce083 [MPSInductor] Adjust memory format detection (#151288)
MPS conv implementation will only yield channels last if input is in channels_last format
Fixes `TestGPUTests.test_conv2d_backward_channels_last` on MacOS-15

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151288
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272, #151282
2025-04-15 06:25:00 +00:00
b8a2824755 [MPS] Fix logit output for half/bfloat (#151282)
Which also fixes MPSInductor pointwise test
TODO: (as followup PRs): get rid of special native_function.yaml dispatches and use stub
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151282
Approved by: https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272
2025-04-15 06:25:00 +00:00
a2f7764507 [Dynamo] Fix the unimplemented_v2 of EventVariable.call_method in ctx_manager.py (#151208)
Changes:
- Field of `explanations` shoule be `str` instead of `tuple`
- Not only `torch.cuda.Event`, but alse `torch.xpu.Event` can trigger this message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151208
Approved by: https://github.com/Skylion007
2025-04-15 05:26:39 +00:00
9e20a8411b make einsum unbacked friendly (#151032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151032
Approved by: https://github.com/pianpwk
2025-04-15 04:35:17 +00:00
5a51de5ab1 [cutlass backend] Add more logs for cutlass backend benchmark (#150639)
Goal is to have a way to compare if a change make it better or worse.

```
Average edge over aten (max(-edge, 0), higher is better):
triton: 8.596507086950552 (from 6 valid values)
triton_persistent_tma: 9.517193693923307 (from 6 valid values)
cutlass_lvl_default: 3.3234737908691785 (from 6 valid values)
cutlass_lvl_1111: 7.088173348313991 (from 6 valid values)
cutlass_lvl_2222: 7.291869722320318 (from 6 valid values)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639
Approved by: https://github.com/ColinPeppler
2025-04-15 04:19:51 +00:00
48b4bc1640 [c10d][fr] Enable FR analysis script for all fast-path coalesce op (#151243)
This PR is to enable FR for all coalesce ops for fast path. (batch p2p is enabled in the current script, so we will mainly focus on non-P2P ops). To explain what is fast path, let's revisit how coalesced collective is working today:

For non-P2P coalesced ops, there are are several ways to call it (due to legendary reasons):

- Way one: Directly call python api like all_reduce_coalesced in python, this will be deprecated soon.
- Way two: Directly call api inside PGNCCL like allreduce_coalesced. The way case 1 will eventually call into this. This is not deprecated and will not be deprecated, IIUC.
- Way three: Using _coalescing_manager in python, like:
```
with _coalescing_manager():
    for i in range(num_colls):
           dist.all_reduce(tensors[i])
```
This way has two path:
   - Fast path: when users call all-reduce, all-gather-into-tensor or reduce-scatter, we will only launch one big collective by calling the api from case 1.
   - Slow path: we call startCoalescing() in the beginning and then a bunch of collectives (each one will generate a FR entry) and then endCoalescing(). Inside startCoalescing(), groupStart() is called and inside endCoalescing(), groupEnd() is then called. So although this is going to be one collective, we call into PGNCCL for each collective coalesced in the slow path case.
   - For uneven all-gather (allgather_v) and reduce-scatter, it follows the pattern mention in slow path. It directly call cpp api inside PGNCCL.

This PR addressed the fast path because this is just an easy case, we store the collectives info on the python side, and we will only call into PGNCCL once so there will only be one work and one FR entry. We can just treat them as regular coalesced collective.

We add some e2e unit test for build_db function so that the change to FR is more thoroughly tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151243
Approved by: https://github.com/d4l3k, https://github.com/wz337
2025-04-15 04:08:28 +00:00
f66229de2b [dynamo] Remove traceable_tensor_subclasses-related code (#151062)
Since #149792 deprecates `traceable_tensor_subclasses` and it's been
landed for over a week, we can safely remove all the old code that uses
`traceable_tensor_subclasses` (they were primarily for testing purposes
and are equivalent to no-ops now).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151062
Approved by: https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #151060, #151061
2025-04-15 03:55:35 +00:00
6a1499d209 [dynamo] handle tensor subclass with non-classmethod __torch_function__ (#151061)
As title, this patch fixes bugs in
1. emulating `has_torch_function`
2. emulating calling `__torch_function__`
3. building a callable VT for non-classmethod `__torch_function__`

Fixes #120799, #150265, #150848.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151061
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #151060
2025-04-15 03:55:34 +00:00
73129b8974 [dynamo] Properly handle super().some_classmethod(...) (#151060)
Previously we were passing in the instance as first argument to a
`super().some_classmethod(...)` call, but we should've passed in the
type object instead, per semantics of `@classmethod`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151060
Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/anijain2305
2025-04-15 03:55:34 +00:00
e178a3aa94 clang-format CUDASymmetricMemory.cu (#151260)
Ported from #146592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151260
Approved by: https://github.com/Skylion007
2025-04-15 02:00:34 +00:00
25803d3a22 Optimize typing in lr_scheduler.py (#151219)
## Changes

- Add typing annotation in `lr_scheduler.py`

## Test Result

```bash
pytest test/optim/test_lrscheduler.py -vv
```

![image](https://github.com/user-attachments/assets/34a91965-ff3a-462a-9ab0-b46ad4b290e9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151219
Approved by: https://github.com/janeyx99
2025-04-15 01:00:13 +00:00
4ede6705b5 test_store: fix timeout for test_queues (#151252)
Fixes #151216, #151215

Previously I forgot to revert the timeout after setting it for the timeout test.

To prevent this in the future I split the test into 3 different tests so timeout testing is isolated.

Test plan:

Stress tested

```
pytest test/distributed/test_store.py -k queue -v -s --minutes 10
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151252
Approved by: https://github.com/XilunWu
2025-04-15 00:44:19 +00:00
263f08e119 [PP] Add schedule visualizer (#150347)
Added a new private file (`_schedule_visualizer.py`) with some helper methods that can be used to visualize the operations of a schedule and plot with matplotlib.

InterleavedZeroBubble(pp_group=4, microbatches=8):
![image](https://github.com/user-attachments/assets/610ba9a8-7d18-4a99-bcad-6f43e5b23c8c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150347
Approved by: https://github.com/kwen2501
2025-04-15 00:38:18 +00:00
070357b61a [MPSInductor] Fix silent correctness in bitcast (#151272)
By using Metal `as_type` which according to documentation does exactly
that:
> Metal adds an as_type<type-id> operator to allow any scalar or vector data type (that is not
a pointer) to be reinterpreted as another scalar or vector data type of the same size. The bits in
the operand are returned directly without modification as the new type. The usual type
promotion for function arguments is not performed.

Using `reinterpret_cast` created a potential silent correctness error when dtypes of different sizes were bitcast to each other
Add expicit cast to src_type to avoid errors due to type promotion (i.e.
soemthing like (x+1).view(dtype=torch.float16) would work correctly in
eager mode for int16 dtype, but would fail in compile, as arithmetic
operations will promote int16 to int32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151272
Approved by: https://github.com/dcci
ghstack dependencies: #151224, #151246
2025-04-14 23:39:42 +00:00
508b882513 [dynamo][invoke_subgraph] Use FxGraphModule comparison instead of hashing (#150911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150911
Approved by: https://github.com/zou3519
2025-04-14 23:34:26 +00:00
a24a9c42fb [ROCm] Improve behavior of get_torch_rocm_version helper function on non-ROCm systems. (#151040)
Fixes #150041

Return a zero tuple when ROCm is _not_ supported, similar to what is done for the CUDA version of this function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151040
Approved by: https://github.com/jeffdaily
2025-04-14 22:50:07 +00:00
c9aef50898 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 22:00:09 +00:00
4a47dd9b3f Revert "[map] always turn on dynamo for map (#150962)"
This reverts commit a72d56cb6be8c6ded5678b0b98003c90fd1b5a71.

Reverted https://github.com/pytorch/pytorch/pull/150962 on behalf of https://github.com/Camyll due to breaking internal builds {SHORT_REASON} ([comment](https://github.com/pytorch/pytorch/pull/150962#issuecomment-2803006282))
2025-04-14 21:09:22 +00:00
6a77a0a50c Revert "[map] make proxy mode re-dispatch to fake key (#151034)"
This reverts commit ca2e8cd3528635526a3fe09444139ffa748e97be.

Reverted https://github.com/pytorch/pytorch/pull/151034 on behalf of https://github.com/Camyll due to breaking internal builds {SHORT_REASON} ([comment](https://github.com/pytorch/pytorch/pull/150962#issuecomment-2803006282))
2025-04-14 21:09:21 +00:00
070f389745 Mark auto_functionalized HOPs as cacheable (#151194)
Fixes #151188

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151194
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #151193
2025-04-14 20:05:32 +00:00
dea50b0778 Improve sort with non-constant keys error message (#151193)
Fixes https://github.com/pytorch/pytorch/issues/143505

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151193
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42
2025-04-14 20:05:32 +00:00
46ce8f7df6 [MPSInductor] Cast halfs to floats (#151246)
To avoid accuracy issues when small reductions are unrolled, cast half to float during the `load` op
As `op_math_t<half>` is indeed float

This fixes `test_unroll_small_reduction` for reduced precision types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151246
Approved by: https://github.com/dcci
ghstack dependencies: #151224
2025-04-14 19:47:04 +00:00
0a6e1d6b9b Expand docs for nn.functional, and make the wording consistent (#148436)
Expands the docs for the loss functions, and makes the wording consistent.

Fixes #148353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148436
Approved by: https://github.com/albanD
2025-04-14 19:37:12 +00:00
23a3cef5d9 [c10d] Add _allgather_base , reduce_scatter , and _reduce_scatter_base into ProcessGroupMPI to enable FSDP with MPI backend (#150162)
This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication.

### Context

As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI.

By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather.

### Testing

We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162
Approved by: https://github.com/H-Huang
2025-04-14 19:31:38 +00:00
7deed1946f Fix assert_tensor_meta (#150808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150808
Approved by: https://github.com/pianpwk
ghstack dependencies: #150806, #150807
2025-04-14 19:28:54 +00:00
53528440e1 Generate meta kernel with operator profiles (#150807)
Added a context manager, `torch._library.fake_profile.register_fake_profile(op_profiles)`, where given an operator profile, it will generate and register a fake impl for the operator based on the operator profile.

The input to `register_fake_profile` is a dictionary mapping operator name to a set of profiles which describe the input and outputs of the operator. Here's an example of a profile for `mylib.foo.default`:
```
"mylib.foo.default": {
    OpProfile(
        args_profile=(
            TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
            TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
        ),
        out_profile=TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
    )
}
```
`foo`'s profile contains only one profile, which says that for 2 input tensors of rank 2, dtype float32, device cpu, we will return one tensor of rank 2, dtype float32, and device cpu.

This will then generate a fake kernel where given 2 input tensors of rank 2 (and the other tensor metadata), we will output one tensor of rank 2 (and the other tensor metadata). If the operator also supports other input ranks, then we can add to the profile for the fake impl to support more input types.

This profile can either be manually written or created by draft-export, and then checked into the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150807
Approved by: https://github.com/zou3519
ghstack dependencies: #150806
2025-04-14 19:28:54 +00:00
901e37515f [ONNX] Fix bfloat16 support in onnx_program callable (#151121)
- Added a test to guard bfloat16. The optimizer incorrectly turns bfloat16 initializers into uint16, but this is not relevant to export logic.
- Fix bfloat16 support in onnx_program callable

Tested with the following with cuda

```py
import torch

class BfloatModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.param = torch.nn.Parameter(torch.tensor(2.0, dtype=torch.bfloat16))

    def forward(self, x):
        return x * torch.tensor(1.0, dtype=torch.bfloat16) * self.param

input = torch.randn(1, 10, dtype=torch.bfloat16)
model = BfloatModel()
onnx_program = torch.onnx.export(model, (input,), dynamo=True, optimize=False, verify=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151121
Approved by: https://github.com/titaiwangms

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-04-14 19:27:29 +00:00
f76b7ef33c Add error check for out variant of tensordot function with requries_grad tensor (#150270)
Fixes #147846. Previously there is no error out under out variant of`tensordot` while `requires_grad=True`. This can cause potential issue when out tensor is part of a computation graph.

Enforces the out variant of tensordot to run without setting `requries_grad=True`. Change same to #117067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150270
Approved by: https://github.com/soulitzer
2025-04-14 18:43:14 +00:00
1f5af12cd9 Using hasattr for _boxed_call is asking for trouble (#151130)
Summary:
There are a number of places in the code checking for the existence of `_boxed_call` instead of checking for a `True` value. This is somewhat dangerous because one would assume that setting it to `None` or `False` would be the same as not setting it (output_code.py does this, for example).

Change `hasattr()` to `getattr(..., False)` for these cases.

Test Plan: unit tests pass

Differential Revision: D72806693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151130
Approved by: https://github.com/Skylion007
2025-04-14 18:36:30 +00:00
6dddd6520d [dynamic shapes] add sym_and, sym_or (#150456)
This has been pretty helpful for the size-oblivious rewrite. Wanted the variadic args version to avoid `sym_or(a, sym_or(b, sym_or(c, d)))` in favor of `sym_or(a, b, c, d)`. Happy to change this to ban the 1-arg version.

This is better than plain and/or because the whole symbolic expression gets preserved, and if we guard on it or defer as a runtime assert, we preserve all branches.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150456
Approved by: https://github.com/laithsakka
2025-04-14 18:18:06 +00:00
785495ee29 [dynamo][error message] Hint for dict_items as inputs to the compiled region (#151169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151169
Approved by: https://github.com/zou3519
ghstack dependencies: #151164, #151168
2025-04-14 17:38:20 +00:00
3c46808a14 [dynamo] Graph break fixes while tracing inspect module (#151168)
Fixes https://github.com/pytorch/pytorch/issues/139374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151168
Approved by: https://github.com/jansel
ghstack dependencies: #151164
2025-04-14 17:38:20 +00:00
b0bdd76f2e [scan] Autograd with partial gradient support (#146285)
This PR introduces the Autograd feature for scan with partial gradient support. It is a combination of the already opened PRs: https://github.com/pytorch/pytorch/pull/135631 and https://github.com/bohnstingl/pytorch/pull/4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146285
Approved by: https://github.com/ydwu4

Co-authored-by: Yidi Wu <yidi@meta.com>
2025-04-14 17:01:31 +00:00
50abc1ecc4 Super tiny fix typo (#151212)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151212
Approved by: https://github.com/Skylion007
2025-04-14 16:47:40 +00:00
184ac8c7f7 [MPSInductor] Fix noop codegen (#151224)
By adding `pass` in front of the comment for fake set_device call
Which fixes `TestGPU.test_zero_element_mutation_mps`, which previously
failed with
```
torch._inductor.exc.InductorError: RuntimeError: Failed to import /var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmp2emka_sx/7k/c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py
IndentationError: expected an indented block after 'with' statement on line 38 (c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py, line 40)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151224
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
2025-04-14 16:38:47 +00:00
001695c397 [ROCm][CI] Enable distributed CI on MI300 (#150667)
* Enable distributed CI on MI300 runners, same schedule-based and release-branch triggers as `periodic.yml`; also uses label `ciflow/periodic-rocm-mi300` for triggering on PRs.
* Disabled failing distributed tests on MI300 via Github issues: [151077](https://github.com/pytorch/pytorch/issues/151077), [151078](https://github.com/pytorch/pytorch/issues/151078), [151081](https://github.com/pytorch/pytorch/issues/151081), [151082](https://github.com/pytorch/pytorch/issues/151082), [151083](https://github.com/pytorch/pytorch/issues/151083), [151084](https://github.com/pytorch/pytorch/issues/151084), [151085](https://github.com/pytorch/pytorch/issues/151085), [151086](https://github.com/pytorch/pytorch/issues/151086), [151087](https://github.com/pytorch/pytorch/issues/151087), [151088](https://github.com/pytorch/pytorch/issues/151088), [151089](https://github.com/pytorch/pytorch/issues/151089), [151090](https://github.com/pytorch/pytorch/issues/151090), [151153](https://github.com/pytorch/pytorch/issues/151153)
* Disable failing distributed tests via `skipIfRocm`: ea9315ff95

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150667
Approved by: https://github.com/jeffdaily
2025-04-14 16:19:04 +00:00
cyy
eb19f5abab [2/N] Use internal linkage in aten C++ files (#151070)
Turn functions and variables into static if they are not used outside the ten cpp files. In some cases, missing header inclusion is added. In other cases, unused functions are removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151070
Approved by: https://github.com/Skylion007
2025-04-14 16:07:17 +00:00
24b3ab9255 Revert "Add inductor standalone_compile API (#150670)"
This reverts commit bbc5fe850454df6860814ab77a1f3a4ca3698157.

Reverted https://github.com/pytorch/pytorch/pull/150670 on behalf of https://github.com/albanD due to Broke profiler test ([comment](https://github.com/pytorch/pytorch/pull/150670#issuecomment-2802067144))
2025-04-14 15:22:33 +00:00
d99236b68c Optimize cdist param description (#151178)
Fixes #151101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151178
Approved by: https://github.com/soulitzer
2025-04-14 13:53:10 +00:00
8497491f38 [ez] remove unused arg in _create_wrapped_callback (#151179)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151179
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #150753, #150754, #150755, #150828
2025-04-14 12:54:23 +00:00
d5a19e4525 [ez] dynamo fix typo in comment (#150828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150828
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #150753, #150754, #150755
2025-04-14 10:09:28 +00:00
5eebcb991a Add scripts to generate plots of LRSchedulers (#149189)
Fixes #92007

## Changes

- Add script to generate plots for `lr_scheduler`
- Add plots to `lr_scheduler` docs
- Add example section if it missing in `lr_scheduler` docs

## Test Result

### LambdaLR

![image](https://github.com/user-attachments/assets/37fc0894-e2ec-48f2-a2d6-3514e51e1ea2)

### MultiplicativeLR

![image](https://github.com/user-attachments/assets/2122b3a0-a4ce-42c7-bb45-559c1fc73e0f)

### StepLR

![image](https://github.com/user-attachments/assets/47bc9d96-4b60-4586-a000-f213583bbe8f)

### MultiStepLR

![image](https://github.com/user-attachments/assets/c822b849-d5be-4b94-aa7a-0017a2c9ff15)

### ConstantLR

![image](https://github.com/user-attachments/assets/83107cdd-7b00-44a6-b09d-e8ee849b4a12)

### LinearLR

![image](https://github.com/user-attachments/assets/60190105-691a-4101-8966-5b0c396093a4)

### ExponentialLR

![image](https://github.com/user-attachments/assets/dfcbcbca-89e5-4a2f-b1bd-33e25d2405ec)

### PolynomialLR

![image](https://github.com/user-attachments/assets/7c3d4fce-c846-40a0-b62e-f3e81c7e08bd)

### CosineAnnealingLR

![image](https://github.com/user-attachments/assets/26712769-dde9-4faa-b61b-e23c51daef50)

### ChainedScheduler

![image](https://github.com/user-attachments/assets/20734a8b-e939-424f-b45a-773f86f020b1)

### SequentialLR

![image](https://github.com/user-attachments/assets/2cd3ed67-2a0a-4c42-9ad2-e0be090d3751)

### ReduceLROnPlateau

![image](https://github.com/user-attachments/assets/b77f641e-4810-450d-b2cd-8b3f134ea188)

### CyclicLR

![image](https://github.com/user-attachments/assets/29b8666f-41b3-45e4-9159-6929074e6108)

### OneCycleLR

![image](https://github.com/user-attachments/assets/d5b683ef-41e8-4ca8-9fe8-0f1e6b433866)

### CosineAnnealingWarmRestarts

![image](https://github.com/user-attachments/assets/1d45ea80-dea8-494d-a8ab-e9cfc94c55d6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149189
Approved by: https://github.com/janeyx99
2025-04-14 09:53:38 +00:00
5a64476ed6 [Easy] Add output_size in forward method of ConvTranspose2d (#150609)
Fixes #74593

Add description for `forward` in [ConvTranspose2d](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html) doc

## Test Result

![image](https://github.com/user-attachments/assets/eebad7a2-f782-4219-9756-344e0f34fada)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150609
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2025-04-14 09:53:22 +00:00
01f226bfb8 Add check for ctc_loss targets param (#150981)
Fixes #150835

## Test Result

```python
# cuda
>>> import torch
>>> import torch.nn.functional as F
>>> device = "cuda" # "cpu" is fine
>>> num_classes = 4
>>> log_probs = torch.rand(0, 0, num_classes, device=device)
>>> targets = torch.tensor([], device=device, dtype=torch.long)
>>> input_lengths = torch.tensor([], device=device, dtype=torch.long)
>>> target_lengths = torch.tensor([], device=device, dtype=torch.long)
>>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss
    return torch.ctc_loss(
           ^^^^^^^^^^^^^^^
RuntimeError: log_probs tensor must not be empty

# cpu
>>> device = "cpu"
>>> num_classes = 4
>>> log_probs = torch.rand(0, 0, num_classes, device=device)
>>> targets = torch.tensor([], device=device, dtype=torch.long)
>>> input_lengths = torch.tensor([], device=device, dtype=torch.long)
>>> target_lengths = torch.tensor([], device=device, dtype=torch.long)
>>> result = F.ctc_loss(log_probs, targets, input_lengths, target_lengths, reduction='none')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3079, in ctc_loss
    return torch.ctc_loss(
           ^^^^^^^^^^^^^^^
RuntimeError: log_probs tensor must not be empty

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150981
Approved by: https://github.com/eqy
2025-04-14 07:24:30 +00:00
bbc5fe8504 Add inductor standalone_compile API (#150670)
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.

```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2025-04-14 07:07:10 +00:00
189bc9283e [ez] move GuardsContext code comment to the right place (#150755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150755
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #150753, #150754
2025-04-14 07:03:23 +00:00
9757092aed [executorch hash update] update the pinned executorch hash (#151195)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151195
Approved by: https://github.com/pytorchbot
2025-04-14 05:46:54 +00:00
0d09a33819 [Attention] Always pad in preprocess_mask to avoid recompilations (#150403)
Motivation: for the following script:

```
// demo.py
import torch
import json
from transformers import BertModel, BertConfig

CONFIG = """
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
"""

config = json.loads(CONFIG)
bloom_config = BertConfig(**config)
model = BertModel(bloom_config).half().cuda()

torch.compiler.reset()
torch.cuda.empty_cache()
compiled_fn = torch.compile(model)
vocab_size = 30522

for b in range(1, 3):
    for s in range(1, 10):
        print(f"🚀 {b} {s}")
        input_ids = torch.randint(0, vocab_size, (b, s)).cuda()
        attention_mask = torch.ones(b, s).cuda()

        with torch.no_grad():
            out = compiled_fn(input_ids, attention_mask).last_hidden_state
```

when we run it with:

```
time TORCH_LOGS=recompiles python demo.py
```

We can see there are 7 recompilations and it takes 2 mins (fresh build) or 1 min (cached build)  in my machine.

One root cause of the recompilations is, there are guards to check the alignments of the inputs (see the patch).  So there are unexpected recompilations for `(1, 4)`, `(1, 8)`, `(2, 4)` and `(2, 8)` inputs.

In this patch, we always try to always pad the inputs if we don't know its shape at compilation to avoid the guards on alignment. It is fine to always pad the tensor. It won't change the semantics.

Now there are only 3 recompilations and it takes 1 min (fresh build) and 17s (cached build) in my machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150403
Approved by: https://github.com/drisspg
2025-04-14 04:18:22 +00:00
9458b83729 [HPU] Add HPU as a supported device for NestedTensor (#148659)
This change enables basic NestedTensor operations on HPU,
    fixing the runtime error when creating a NestedTensor on HPU.

    - Extended `NestedTensorImpl` to recognize `hpu` as a valid storage device.
    - Added `NestedTensorHPU` to `DispatchKey` parsing in `DispatchKey.cpp`.
    - Updated `torchgen/model.py` to include `NestedTensorHPU` in `dispatch_keys`.
    - Modified `native_functions.yaml` to enable `NestedTensorHPU` support for various ops.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148659
Approved by: https://github.com/jeromean, https://github.com/albanD, https://github.com/sujoysaraswati
2025-04-14 03:42:34 +00:00
9aca00102f [ez]][dynamo] remove useless super().__init__() (#150754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150754
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #150753
2025-04-14 03:37:42 +00:00
101c4f482a Docs: Fix typos in the Symbolic Numbers docstrings (#151181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151181
Approved by: https://github.com/soulitzer
2025-04-14 01:46:02 +00:00
ddfc14b3ae [MPS] Fix where (#151176)
Fixes #150967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151176
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-04-13 20:44:50 +00:00
8494d5582a Propagate callable parameter types using ParamSpec (#142306) (#151014)
Partially addresses #142306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151014
Approved by: https://github.com/Skylion007
2025-04-13 20:38:11 +00:00
3f0931b1de [ez][dynamo] some code movement (#150753)
`optimize_assert` already does the lookup for `backend` and
`backend_ctx_ctor`. This simply moves the lookups within `optimize`
lower so we don't end up calling these functions twice unnecessarily
in the `optimize_assert` path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150753
Approved by: https://github.com/anijain2305, https://github.com/jansel
2025-04-13 15:44:42 +00:00
b0810168a3 Generalize poison fork logic for each device backend (#144664)
# Motivation
Generalize the posion_fork code to make it reusable across different devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-04-13 09:54:30 +00:00
304633152c Clean up duplicated code in lr_scheduler (#150984)
## Changes

- Remove duplicated code in `ReduceLROnPlateau`
- Remove redundant `noqa` comment

## Test Result

```bash
pytest test/optim/test_lrscheduler.py
```

![image](https://github.com/user-attachments/assets/37f91f31-0e77-4abf-9dd1-75538c0f0792)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150984
Approved by: https://github.com/janeyx99
2025-04-13 09:18:50 +00:00
b59f3d3ae0 [Intel GPU] skip a cuda api call in amp to save some host overhead on xpu (#151111)
This can save ~0.2ms on non cuda devices by skip calling `amp_definitely_not_available()`. It can improve small models in torchbench like lennard_jones on xpu 10% on both eager and inductor in dynamo benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151111
Approved by: https://github.com/soulitzer
2025-04-13 06:37:07 +00:00
1c5619ef9c [DTensor] Add DTensor redistribute fwd/bwd datatype conversion to enable SimpleFSDP mixed precision training (#150740)
As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training.

In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`.

1. **Correctness**: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor.

2. **Example Usage**: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060.

In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`.

```python
output = x.redistribute(
        placements=self.compute_placements,
        forward_dtype=self.param_dtype,
        backward_dtype=self.reduce_dtype,
).to_local(grad_placements=self.grad_placements)
```

Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`.

The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740
Approved by: https://github.com/tianyu-l
2025-04-13 05:49:03 +00:00
00c6caaf3d [executorch hash update] update the pinned executorch hash (#150722)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150722
Approved by: https://github.com/pytorchbot
2025-04-13 05:37:33 +00:00
587aec2b4f [dynamo][nn_module] Use method.__self__ to find source for patched methods (#151164)
Fixes https://github.com/pytorch/pytorch/issues/137476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151164
Approved by: https://github.com/jansel
2025-04-13 04:50:19 +00:00
7b1a2373e8 [dynamo][super variable] Fix bug to use correct source (#151154)
Fixes https://github.com/pytorch/pytorch/issues/150994

We should cherry-pick to 2.7 branch if possible, because this breaks torch.compile on some HF models. Look at the issue referenced here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151154
Approved by: https://github.com/jansel
2025-04-13 04:48:52 +00:00
8157e76b79 Revert "[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)"
This reverts commit fe7f425de7b76ef33d308d0a03779b97a914d186.

Reverted https://github.com/pytorch/pytorch/pull/150458 on behalf of https://github.com/clee2000 due to broke a lot of tests internally? D72906459 ([comment](https://github.com/pytorch/pytorch/pull/150458#issuecomment-2799578597))
2025-04-13 03:52:42 +00:00
67188cd38d [Testing] Skip test_unspec_inputs_float64_mps (#151167)
As backend does nto support float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151167
Approved by: https://github.com/dcci
ghstack dependencies: #151166
2025-04-13 00:41:51 +00:00
d289d1177c [CI] Fix GPUTests.test_scheduler_vertical_fusion1 (#151166)
By enabling the test_operators on MPS device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151166
Approved by: https://github.com/dcci
2025-04-13 00:41:51 +00:00
9699cc3eb9 [MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152)
By using `welford_combine` primitive in the loop
This fixes `GPUTests.test_multilayer_var_lowp_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151152
Approved by: https://github.com/jansel
ghstack dependencies: #151042, #150824, #151151
2025-04-12 21:44:51 +00:00
7762bddd87 Revert "[MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152)"
This reverts commit 71073caa00836c23e3fc7fcfe1d69b77ffb9d9c9.

Reverted https://github.com/pytorch/pytorch/pull/151152 on behalf of https://github.com/malfet due to Another lint failure ([comment](https://github.com/pytorch/pytorch/pull/151152#issuecomment-2799027274))
2025-04-12 20:27:48 +00:00
3dcb46c30e [easy] Add cache bypass traceback information to cache_info on autograd_cache_bypass (#151025)
This will help us better debug pickling errors, etc, in internal models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151025
Approved by: https://github.com/masnesral
2025-04-12 19:56:32 +00:00
9d4de265db [AMD] Block mem efficient attention for FP32 in CK backend (#151132)
Summary: CK doesn't support FP32 attention, but aotriton does. If we prefer CK, and the input dtype is FP32, we'll select mem efficient attention but CK doesn't support it. So we'll exclude mem eff attention and pick math.

Differential Revision: D72880985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151132
Approved by: https://github.com/yoyoyocmu
2025-04-12 19:36:20 +00:00
71073caa00 [MPSInductor] Fix larger-than-threadgroup Welford reductions (#151152)
By using `welford_combine` primitive in the loop
This fixes `GPUTests.test_multilayer_var_lowp_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151152
Approved by: https://github.com/jansel
ghstack dependencies: #151042, #150824, #151151
2025-04-12 19:16:33 +00:00
3b86cb8dff [MPSInductor][BE] Implement reduction caching (#151151)
That avoids double/triple invocation of welford reductions when both
mean and deviation must be returned

Code has been copy-n-pasted for Halide implementation
575f348965/torch/_inductor/codegen/halide.py (L1189-L1191)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151151
Approved by: https://github.com/jansel
ghstack dependencies: #151042, #150824
2025-04-12 19:16:33 +00:00
2653498ff3 [Openreg][PrivateUse1] Refactor csrc files of Pytorch_openreg (#151004)
I want to format and refactor the csrc file of pytorch_openreg. To make the code review clearer and easier to understand, I divide the code refactoring into two parts:

- Part 1: Code formatting
- Part 2: Code refactoring and optimization (Next PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151004
Approved by: https://github.com/albanD
ghstack dependencies: #151000
2025-04-12 17:22:28 +00:00
c181403063 [Openreg][PrivateUse1] Improve openreg module capabilities (#151000)
----

- Add more functionalities for openreg in openreg module
- Remove related functionalities from test_cpp_extensions_open_device_registration.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151000
Approved by: https://github.com/albanD
2025-04-12 17:21:35 +00:00
be24e7b4b4 [dynamo] Use sentinel value for guard filter. (#151131)
Summary: `None` can collide with the real values in the scope, so we should use a separate value. Also added "has_value" to the struct so that it's more clear whether the value is absent or not.

Test Plan: CI

Differential Revision: D72881300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151131
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-04-12 15:29:57 +00:00
5b16a0704e Fix license check for setuptools>=77 (#151158)
Fixes #151157

See issue for more information
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151158
Approved by: https://github.com/malfet
2025-04-12 13:41:12 +00:00
7dd2ed1197 [dtensor] add op support for torch._grouped_mm (#151072)
This PR would make TP work with Grouped MM in MoE implementations like https://github.com/pytorch/torchtitan/pull/1084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151072
Approved by: https://github.com/wanchaol, https://github.com/wwwjn
2025-04-12 07:07:44 +00:00
0c59a031c8 [OpenReg][PrivateUse1] add device context for OpenReg Module (#150997)
Add device context support for OpenReg Module, which is depended by
some tests such as ``torch.serialization.default_restore_location``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150997
Approved by: https://github.com/albanD
2025-04-12 06:32:30 +00:00
3e9f4f3f78 docs: allow empty targets tensor in ctc_loss (#151080)
docs: allow empty targets tensor in ctc_losswhen target_lengths are zero, as described in issue

Fixes #150995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151080
Approved by: https://github.com/albanD
2025-04-12 05:26:54 +00:00
2f899f07aa Revert "Make export._trace._WrapperModule work in strict mode (#146919)"
This reverts commit dad5e5e2622c82ca272290225abe16ee461d9ac9.

Reverted https://github.com/pytorch/pytorch/pull/146919 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/14415686353/job/40431799827 ([comment](https://github.com/pytorch/pytorch/pull/146919#issuecomment-2798446930))
2025-04-12 04:12:36 +00:00
dad5e5e262 Make export._trace._WrapperModule work in strict mode (#146919)
Summary:
as title

`export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function.

We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule`

Fixes https://github.com/pytorch/pytorch/issues/146867

Test Plan:
```
 buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module
```

Differential Revision: D69434316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146919
Approved by: https://github.com/angelayi
2025-04-12 03:22:08 +00:00
19b76bd873 hack to try to fix not empty triton dir (#151119)
Differential Revision: D72741938

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151119
Approved by: https://github.com/hl475, https://github.com/muchulee8, https://github.com/Skylion007
2025-04-12 03:21:41 +00:00
c1470d4dc4 [graph partition] support graphsafe_run_with_rng_state (#150958)
Prior to this PR, `rng_state` is in `V.graph.graph_inputs` but not in read_writes of any IRNode. As a result, it is not identified as a partition inputs:
```python
def partition_0(args):
    primals_2, primals_1 = args
    ...
    buf0 = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype=torch.float32, device=device(type='cuda', index=1), pin_memory=False, rng_state=fwd_rng_state_0)
    # <----- access fwd_rng_state_0 but it's not an input
    ...

def call(self, args):
    primals_1, primals_2, fwd_rng_state_0 = args
    ...
    partition0_args = [primals_2, primals_1]
    (buf2, primals_2, primals_1) = self.partitions[0](partition0_args)
     # <---- fwd_rng_state_0 is graph_inputs but is not passed to partitions[0]
     ...
```

This PR fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150958
Approved by: https://github.com/eellison
2025-04-12 03:17:08 +00:00
397d37acc5 [MPSInductor] Naive welford_reduce implementation (#150824)
Literal Python-to-Metal translation of
85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)

Fixed missing barrier in `welford_combine`
And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #151042
2025-04-12 03:11:38 +00:00
32f0f414ab Add some autograd producer consumer stream sync tests (#150952)
Thanks @ngimel and @albanD for some ideas on test cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150952
Approved by: https://github.com/albanD
2025-04-12 02:44:09 +00:00
397b7f9b82 [custom ops] Override fake registration (#150806)
Added a flag, `allow_override`, to allow overriding existing kernel implementations in `torch.library.register_fake` `library.impl`. The default is false, where if a user tries to register a kernel to a dispatch key that already contains a kernel, it will error. This flag doesn't apply to CustomOpDefs, where overriding a fake kernel is already allowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150806
Approved by: https://github.com/zou3519
2025-04-12 02:43:47 +00:00
77407b38a9 Revert "[MPSInductor] Naive welford_reduce implementation (#150824)"
This reverts commit 575f348965abe8ea428eba7098f67ec9764a7f9a.

Reverted https://github.com/pytorch/pytorch/pull/150824 on behalf of https://github.com/malfet due to Linter fails again, landrace this time? ([comment](https://github.com/pytorch/pytorch/pull/150824#issuecomment-2798392241))
2025-04-12 02:22:22 +00:00
f6e9e064a7 [CI][CUDA] xfail grouped gemm unit tests on blackwell (#150982)
On SM100OrLater, Expect failures like:

RuntimeError: torch._grouped_mm is only supported on CUDA devices with compute capability = 9.0

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_False_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

`
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0005s] (Issue with numpy versi...) [  2%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [  4%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [  6%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [  8%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 10%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 12%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 14%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_2d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 16%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versi...) [ 18%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 20%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 22%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 25%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 27%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 29%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 31%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_2d_3d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 33%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0002s] (Issue with numpy versi...) [ 35%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 37%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 39%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 41%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 43%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 45%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 47%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_2d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 50%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versi...) [ 52%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 54%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 56%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_False_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 58%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_False_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy versio...) [ 60%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_False_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 62%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_True_b_row_major_False_cuda SKIPPED [0.0001s] (Issue with numpy version...) [ 64%]
test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_grouped_gemm_3d_3d_strided_True_a_row_major_True_b_row_major_True_cuda SKIPPED [0.0001s] (Issue with numpy version ...) [ 66%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_False_strided_False_cuda XFAIL [0.8166s]                                        [ 68%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_False_strided_True_cuda XFAIL [0.0017s]                                         [ 70%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_True_strided_False_cuda XFAIL [0.0012s]                                         [ 72%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_2d_fast_accum_True_strided_True_cuda XFAIL [0.0012s]                                          [ 75%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_False_strided_False_cuda XFAIL [0.0033s]                                        [ 77%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_False_strided_True_cuda XFAIL [0.0012s]                                         [ 79%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_cuda XFAIL [0.0015s]                                         [ 81%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_True_cuda XFAIL [0.0012s]                                          [ 83%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_False_strided_False_cuda XFAIL [0.0012s]                                        [ 85%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_False_strided_True_cuda XFAIL [0.0012s]                                         [ 87%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_True_strided_False_cuda XFAIL [0.0011s]                                         [ 89%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_2d_fast_accum_True_strided_True_cuda XFAIL [0.0012s]                                          [ 91%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_False_strided_False_cuda XFAIL [0.0014s]                                        [ 93%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_False_strided_True_cuda XFAIL [0.0012s]                                         [ 95%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_True_strided_False_cuda XFAIL [0.0011s]                                         [ 97%]
test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_grouped_gemm_3d_3d_fast_accum_True_strided_True_cuda XFAIL [0.0011s]                                          [100%]
`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150982
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-04-12 01:53:12 +00:00
fe7f425de7 [Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.

# Feature

This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below.

![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b)

The IR currently supports the following ops:
- All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.)
- Reinterpret views (`ReinterpretLine`)
- Kernel definitions (`KernelDefinitionLine`)
- Calls to defined kernels (`KernelCallLine`)
- Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`)
- Ops with multiple outputs (`MultiOutputLine`)
- Tensor cleanup at the end of a graph (`FreeLine`)
- Leaving comments in code (`CommentLine`)

There are two main motivations for this refactor:
1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`.
2. This design will hopefully promote stronger modularity and encapsulation.
   a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends.
   b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code.

# Implementation details

The implementation mainly consists of separating direct C++/Python codegen into two phases:
 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do.
 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code.

The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to  `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros.

# Test plan

Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage.

The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage.

# Future directions

These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files:
 - User-defined Triton kernels.
 - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.)
 -  Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`.
 -  Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants.

These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes.

One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor.

Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458
Approved by: https://github.com/eellison
2025-04-12 01:15:19 +00:00
575f348965 [MPSInductor] Naive welford_reduce implementation (#150824)
Literal Python-to-Metal translation of
85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)

Fixed missing barrier in `welford_combine`
And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #151042
2025-04-12 00:46:01 +00:00
83f14c0b06 Revert "[MPSInductor] Naive welford_reduce implementation (#150824)"
This reverts commit 5edfb4c4fad1bb9504482d930a2540d22427d383.

Reverted https://github.com/pytorch/pytorch/pull/150824 on behalf of https://github.com/malfet due to I should have waited for lint ([comment](https://github.com/pytorch/pytorch/pull/150824#issuecomment-2798249264))
2025-04-12 00:21:14 +00:00
ca2e8cd352 [map] make proxy mode re-dispatch to fake key (#151034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151034
Approved by: https://github.com/zou3519
ghstack dependencies: #150962
2025-04-11 23:28:06 +00:00
a72d56cb6b [map] always turn on dynamo for map (#150962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150962
Approved by: https://github.com/zou3519
2025-04-11 23:28:06 +00:00
5edfb4c4fa [MPSInductor] Naive welford_reduce implementation (#150824)
Literal Python-to-Metal translation of
85549fe6de/torch/_inductor/runtime/triton_helpers.py (L217-L225)

Fixed missing barrier in `welford_combine`
And this is sufficient to make `GPUTests.test_batch_norm_2d_2_mps` to pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150824
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #151042
2025-04-11 23:21:35 +00:00
eqy
c4f826d5e8 [CUDA][TF32] Account for TF32 in test_alexnet_prefix (#150970)
Mainly seems to be an issue on Blackwell with e.g.,
```
Mismatched elements: 1 / 746496 (0.0%)
Greatest absolute difference: 0.005461275577545166 at index (2, 32, 11, 9)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150970
Approved by: https://github.com/soulitzer
2025-04-11 23:13:54 +00:00
2d187bf7e6 Support tuning of _scaled_grouped_mm (#150421)
This includes the default aten implementation, as well as a Triton
implementation imported from FBGEMM
(https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421
Approved by: https://github.com/ngimel
2025-04-11 23:03:49 +00:00
c3bc6b3542 [DTensor] Fix empty shard global-offset calculation (#150862)
`compute_local_shape_and_global_offset` util computes the local shape of
a particular shard of a DTensor, and the global offset (which describes
how the shard fits into the global tensor).

When the tensor dim does not evenly divide into the mesh dim, uneven
sharding occurs.  In some cases, uneven sharding results in an empty
shard.

e.g.
   tensor dim size: 4096
   mesh dim size: 30
   ranks 0..27 have local size 18
   rank 28 has local size 8
   rank 29 has local size 0 <--- empty shard

The global offset for an empty shard was previously undefined and
returned values that were computed based on logic that assumes no empty
shards.  This caused DCP to fail to save a checkpoint, becuase
deduplication logic could 'throw away' real (non-empty) shards thinking
they were duplicates of zero-sized shards with the same offset.

Now, we define the global offset of an empty shard to be the dim-size,
which is out of bounds of the tensor and can't overlap with any
non-empty shards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150862
Approved by: https://github.com/teja-rao, https://github.com/XilunWu
2025-04-11 22:25:57 +00:00
85549fe6de Add __all__ for torch.utils.dlpack (#149026)
Fixes the issue:

```python
torch.utils.dlpack.to_dlpack(tensor)  # "to_dlpack" is not exported from module "torch.utils.dlpack" Pylance[reportPrivateImportUsage](https://github.com/microsoft/pyright/blob/main/docs/configuration.md#reportPrivateImportUsage)
```

the docs for `torch.utils.dlpack`: https://pytorch.org/docs/stable/dlpack.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149026
Approved by: https://github.com/mikaylagawarecki
2025-04-11 22:03:24 +00:00
2a909cab16 Update ninja missing error message (#147698)
In cpp_extensions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147698
Approved by: https://github.com/Skylion007
2025-04-11 21:56:53 +00:00
a78ac409b5 [AOTI] Add _weight_int4pack_mm to the C shim fallback list (#151059)
Summary: As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151059
Approved by: https://github.com/yushangdi
2025-04-11 21:22:35 +00:00
12281f9c18 [dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True (#151008)
[dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True

Reading the feature enabling param `enable_cpp_framelocals_guard_eval `at the CPP level is time consuming and slows down the operation of the dynamo as it is done every time the function using this param is called. Reading the value only once at init isn’t an option as it would disable the modification of this param at the runtime. Since this feature is enabled by default for some time and it doesn’t cause known issues, the `enable_cpp_framelocals_guard_eval `configuration param will be deprecated by this commit and its value is hardcoded to true.

Local microbenchmark dynamo_guard_eval.py:
- 931.9 us -> 538.9 us (3.10)

@williamwen42 @jansel @anijain2305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151008
Approved by: https://github.com/williamwen42
2025-04-11 21:07:59 +00:00
8910e4f2bb Fix 32-bit indexing overflows in ReducedPrecisionGemV (#150949)
By chaining `lda` type from `int` to  ~~`long`~~ `int64_t`

Add regression test (but probably restrict it to CPUs (or may be skip float32 testing on GPUs)

Fixes https://github.com/pytorch/pytorch/issues/150637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150949
Approved by: https://github.com/Skylion007
2025-04-11 20:55:20 +00:00
05236b5045 Allow OpaqueTensorImpl to be used for views (#151028)
Summary:
When creating an `OpaqueTensorImpl`, currently there's only an option to create it for a non-view tensor, but it can be useful to create one for view tensors as well.

View tensors should contain the same autograd parameters as the original tensor, whereas non-view tensors get created with whatever `inference_mode` option is currently enabled. For this reason, `TensorImpl` has a special view constructor that takes `TensorImpl::ImplType` as its first parameter, so adding a new constructor to `OpaqueTensorImpl` that does the same thing allows us to create views with it.

Test Plan: CI

Reviewed By: scottxu0730

Differential Revision: D71748460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151028
Approved by: https://github.com/scottxu0730, https://github.com/chaos5958
2025-04-11 20:07:47 +00:00
bb60e82672 c10d/Store: add queues (#150969)
This adds queue operations as described in https://github.com/pytorch/pytorch/issues/150943.

This works by adding two new operations `queue_push` and `queue_pop`. The semantics are designed to be blocking with a timeout. Pushing will always succeed as the queue is infinite size. Popping will first call `wait` until the key is ready and then pop the value from the queue.

This implements queues for only: HashStore, TCPStore w/ libuv. FileStore and the legacy backends are not supported.

`wait` and `check` work for queue operations though queue_push will only wake up the first waiter rather than all of them.

This also has a few cleanups to error types/documentation in related code.

Example trace:

```
[I409 16:51:43.963833529 TCPStoreLibUvBackend.cpp:829] [c10d - trace] validate magic:1015412686 address:[localhost]:55816
[I409 16:51:43.963845838 TCPStoreLibUvBackend.cpp:842] [c10d - trace] ping nonce:2840795 address:[localhost]:55816
[I409 16:51:43.963902914 TCPStoreLibUvBackend.cpp:911] [c10d - trace] add key:init/ val:1 address:[localhost]:55816
[I409 16:51:43.963939389 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:init/ address:[localhost]:55816
[I409 16:51:43.963974842 TCPStoreLibUvBackend.cpp:893] [c10d - trace] get key:init/ address:[localhost]:55816
[I409 16:51:43.964071909 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/test_queue_support address:[localhost]:55816
[I409 16:51:43.964080221 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964108584 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964123207 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964128194 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964156347 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964187493 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964217709 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964324300 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964354495 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964416299 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964458733 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/non_existant address:[localhost]:55816
[W409 16:51:43.974516585 socket.cpp:460] [c10d] waitForInput: poll for socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) returned 0, likely a timeout
[W409 16:51:43.974559169 socket.cpp:485] [c10d] waitForInput: socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) timed out after 10ms
[I409 16:51:43.974600451 TCPStoreLibUvBackend.cpp:1101] [c10d - trace] cancel_wait address:[localhost]:55816
```

Test plan:

```
$ pytest test/distributed/test_store.py -k queue -v -s

test/distributed/test_store.py::FileStoreTest::test_queues SKIPPED [0.4351s] (Store does not support queues)
test/distributed/test_store.py::HashStoreTest::test_queues PASSED [0.0009s]
test/distributed/test_store.py::PrefixFileStoreTest::test_queues SKIPPED [0.0006s] (Store does not support queues)
test/distributed/test_store.py::TCPStoreTest::test_queues SKIPPED [0.0012s] (Store does not support queues)
test/distributed/test_store.py::LibUvTCPStoreTest::test_queues PASSED [0.0014s]
test/distributed/test_store.py::PrefixTCPStoreTest::test_queues PASSED [0.0014s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150969
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2025-04-11 19:24:17 +00:00
83ae61fd8e [Inductor] Add Subgraph as a Autotuning Choice (#150653)
Add the option for providing a Subgraph as an autotuning choice in Inductor. This is crucial for implementing the split-k optimization for GEMMs by decomposing a mm -> bmm. https://github.com/pytorch/pytorch/pull/150654 uses these changes to add decomposeK as a default autotuning choice for aten.mm in Inductor.

Using https://github.com/pytorch/pytorch/pull/150654 and a simple script:

```
import torch

def f(a, b):
    return torch.matmul(a, b)

def decompose_func(a_in, b_in):
    M, K = a_in.shape
    K, N = b_in.shape

    # TODO: Ideally we want to autotune over this parameter
    kPartitions = 256
    assert K % kPartitions == 0, "K must be divisible by Kmini"
    B = K // kPartitions

    a_reshaped = a_in.reshape(M, B, kPartitions).transpose(
        0, 1
      )  # Shape: (B, M, kPartitions)
    b_reshaped = b_in.reshape(B, kPartitions, N)  # Shape: (B, kPartitions, N)
    result = torch.bmm(a_reshaped, b_reshaped)  # Shape: (B, M, N)
    return result.sum(dim=0).to(torch.float16)  # Sum over B dimension, Shape: (M, N)

for k in [4096, 8192, 12288, 16384, 20480, 24576, 28672, 32768]:
    a = torch.randn(32, k, dtype=torch.float16, device="cuda", requires_grad=True)
    b = torch.randn(k, 32, dtype=torch.float16, device="cuda", requires_grad=True)

    compiled_res = torch.compile(f, dynamic=False)(a, b)
    decompose_res = decompose_func(a, b)

    print(f"Compiled mm result close to aten: {torch.allclose(f(a, b), compiled_res, atol=1e-5, rtol=0.5)}")
    print(f"Compiled mm result close to decompose: {torch.allclose(decompose_res, compiled_res, atol=1e-5, rtol=0.5)}")
```

we are able to autotune the decomposeK optimization to aten and the traditional Triton templates in Inductor. DecomposeK is faster than aten by about ~10% on average and > 4x speedup over the best Triton templates on an H100 machine, e.g.:

```
AUTOTUNE mm(32x28672, 28672x32)
  decompose_k_mm 0.0126 ms 100.0%
  mm 0.0144 ms 87.5%
  triton_mm_69 0.0579 ms 21.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_75 0.0677 ms 18.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_76 0.0850 ms 14.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_68 0.1444 ms 8.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_72 0.1546 ms 8.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_74 0.1819 ms 6.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_67 0.1917 ms 6.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_73 0.2766 ms 4.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
```

https://pastebin.com/g3FMaauT is the generated code from Inductor containing the subgraph decomposition for aten.mm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150653
Approved by: https://github.com/eellison
2025-04-11 19:08:43 +00:00
ad5e9065ac [Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068)
Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too.

Test Plan: CI

Reviewed By: sanrise

Differential Revision: D72672186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068
Approved by: https://github.com/davidberard98
2025-04-11 18:50:25 +00:00
fe961679d5 [Inductor] add support for disabling atomic adds (#151033)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151033
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-04-11 18:41:56 +00:00
67d3053d4b Revert "update benchamark result due to <1% regression (#150937)"
This reverts commit 860765d621e14730f8b6e7344da0053c4f00d540.

Reverted https://github.com/pytorch/pytorch/pull/150937 on behalf of https://github.com/laithsakka due to regression diff reverted ([comment](https://github.com/pytorch/pytorch/pull/150937#issuecomment-2797611127))
2025-04-11 17:36:47 +00:00
6b32255e37 [c10d][fr] Add logging of nccl_version into fr and its dump (#151048)
Users also want to see the nccl version in the FR dump so let's add it to FR. We only add it per rank per PG nccl comm, so this is really add a couple bytes to FR memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151048
Approved by: https://github.com/kwen2501
2025-04-11 17:36:09 +00:00
5f5805a6ac Cache the value of torch_key in subproc (#151057)
No need to recalculate torch_key in subprocs, lets pass it from main process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151057
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-04-11 17:30:23 +00:00
fc1cccd012 Register also future allocations in mempool with NCCL (#150684)
This is the final PR, where everything comes together.

The problem I'm trying to solve is the following: when we register a MemPool with the NCCL ProcessGroup, it calls `ncclCommRegister` on all the allocations that are _currently_ in the pool. However, any later allocation will _not_ be registered with the NCCL communicator!

This is terribly inconvenient, because it means that every piece of code that allocates a tensor must be changed to become aware of whether it's doing so within a private pool, and it must become aware of NCCL and of all the PGs in existence, in order to re-register that pool with them.

Moreover, I believe there can be performance implications because allocating tensors is usually done in the critical path (i.e., during the forward and backward of every step of a training), whereas registering memory is a slow operation that should be done once at init time.

With this PR, once the user registers a Mempool with the NCCL PG, we install some hooks into the CachingAllocator in order to listen for all future memory allocations and, if they belong to the pool, we automatically call `ncclCommRegister` on them! (In fact, we reuse the hooks that already exist for `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150684
Approved by: https://github.com/kwen2501
ghstack dependencies: #150683
2025-04-11 17:26:37 +00:00
99642182f2 Add mempool to allocator's trace events (#150683)
In the NCCL ProcessGroup we want to support being able to "register" with NCCL all the allocations that belong to a certain private MemPool. In order to do so on-the-fly for every new allocation, we register a hook for the CachingAllocator's TraceEvents. However, we were lacking a way to know whether a given TraceEvent belonged to the MemPool that we cared about or not. With this PR, we add a MempoolId_t field to the TraceEvents.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150683
Approved by: https://github.com/syed-ahmed, https://github.com/kwen2501
2025-04-11 17:26:37 +00:00
d385179886 [dtensor] add op support for torch.cumsum (#151071)
For `torch.cumsum`, any sharding placement shoud propogate through if the cumsum `dim` is not sharded; otherwise it needs to be replicated first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151071
Approved by: https://github.com/wanchaol
2025-04-11 16:42:19 +00:00
1fe260f7c4 [cutlass backend] Add and fix logs, fix types, and make cutlass generator only generate GEMM (#150973)
Differential Revision: [D72760205](https://our.internmc.facebook.com/intern/diff/D72760205/)

We hardcoded to only use GEMM anyway.

This also raises the problem with high instantiation level. As the instantiation level goes higher (here it is 3333), the time it takes to list the configs might be long already (here it is >3 minutes).

If we know exactly what configs we care, we should have a way to generate them without calling generators. But let's see if we need that.

using this script
```
import os

os.environ["TORCH_LOGS"] = "inductor"

import torch

import torch._inductor.config

torch._inductor.config.max_autotune = True
torch._inductor.config.force_disable_caches = True
torch._inductor.config.max_autotune_gemm_backends = "Aten,CUTLASS"
# intentionally use no cutlass ops
torch._inductor.config.cuda.cutlass_max_profiling_configs = 0
torch._inductor.config.cuda.cutlass_instantiation_level = "3333"

def main():
    M = 128
    dtype = torch.float16
    A = torch.randn(M, M, device="cuda", dtype=dtype)
    B = torch.randn(M, M, device="cuda", dtype=dtype)

    compiled_model = torch.compile(torch.mm)

    _ = compiled_model(A, B)
    print("done")

if __name__ == "__main__":
    main()
```

before, with logs:
```
CUTLASS library generated 7 operations in 235.03 seconds
Got cutlass configs: total number of ops: 4753. Filtering took 10.51 seconds
```

after:
```
CUTLASS library generated 1 operations in 207.39 seconds
Got cutlass configs: total number of ops: 4753. Filtering took 9.53 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150973
Approved by: https://github.com/ColinPeppler
2025-04-11 16:24:26 +00:00
f1364431f0 Add debug_lines of FXGraphCacheKey to AOTAutogradCacheEntry (#150594)
Previously we didn't save debug_lines because it's pretty large, but compared to the size of FXGraphCache entries it's still pretty small. So let's add it to AOTAutogradCache for easier debugability.

Differential Revision: [D72361611](https://our.internmc.facebook.com/intern/diff/D72361611/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150594
Approved by: https://github.com/oulgen
2025-04-11 15:24:13 +00:00
38bec787fa cleanup JK for duplicate pt2 compile callbacks prevention (#148704)
Summary: This diff cleans up the JK we used for enabling `add pt2 callbacks for backward pass and prevent duplicate callbacks` feature.

Differential Revision: D70643543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148704
Approved by: https://github.com/mlazos
2025-04-11 15:17:06 +00:00
91920661b4 Don't log benchmarking event to Scuba (#151053)
These two events are really common, and also make up a huge portion of logs (~70%) we get internally in PT2 Compile Events. I don't think it's actually that useful to aggregate them, so instead of logging them to PT2 Compile Events, lets just only log them to chromium.

These two events will still be visible from tlparse: they just won't be in our internal tables. Please let me know if folks disagree.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151053
Approved by: https://github.com/oulgen, https://github.com/masnesral
2025-04-11 14:56:36 +00:00
d94cc0e994 Optimize ConvTranspose2d stride description (#150819)
Fixes #150775

## Test Result

### Before

![image](https://github.com/user-attachments/assets/81cd932f-9447-4924-9553-a5cb88fc5d0e)

### After

![image](https://github.com/user-attachments/assets/6365c71c-7268-4226-b722-ee7446cb2467)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150819
Approved by: https://github.com/jbschlosser
2025-04-11 09:37:56 +00:00
183bca41de [dynamo] unimplemented -> unimplemented_v2 in variables/builder.py (#151044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151044
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-04-11 09:07:01 +00:00
d6f1c72354 [PrivateUse1] Allow out-of-tree devices to pass check when validating csr tensor args (#149374)
Fixes #149303
Fllow-up: #147306

Because we have a dispatch key named `DispatchKey::SparseCsrPrivateUse1` for this case, we allow users to create a csr tensor on out-of-tree devices, so we should also let that pass the check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149374
Approved by: https://github.com/FFFrog, https://github.com/albanD
2025-04-11 09:05:20 +00:00
5590a0692c [aotinductor] fix std::{min.max} compilation error for sympy expr with multiple args (#150894)
### Compilation error
The issue is that u0 (an unbacked symint) can come from a smaller int dtype e.g. int16, int32.
```
error: no matching function for call to ‘min(int64_t&, short int&)’
  759 |     call_add_kernel_with_scaling_0(... std::min(100L, s97, u0) ...);
```

### Diff
The fix is to explicitly specify `int64_t` in the std::min template.
```
int64_t s97 = arg0_1_size[0];
int16_t u0_raw;      # not a long
auto u0 = u0_raw;

# Before
std::min({100L, s97, u0})
# After
std::min<int64_t>({100L, s97, u0})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150894
Approved by: https://github.com/desertfire
2025-04-11 07:32:47 +00:00
44ed0c9fbb Revert "[profiler] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#150957)"
This reverts commit 37812009fd123d5c4a038ce798eedd4a89eeffad.

Reverted https://github.com/pytorch/pytorch/pull/150957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150957#issuecomment-2795878848))
2025-04-11 05:38:58 +00:00
6c7336cb31 [Profiler][HPU] Enable profiler.key_averages().table() for HPU devices (#150770)
Fixes #150769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150770
Approved by: https://github.com/sraikund16, https://github.com/jeromean
2025-04-11 05:17:12 +00:00
85ada5d6dd [Dynamo] Allow dynamo to handle 'or' operator between two dicts (#147305)
Fixes #146538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147305
Approved by: https://github.com/anijain2305
2025-04-11 04:47:31 +00:00
6f6ff8837a [Inductor UT][Break XPU] Fix UTs for XPU broken by community. (#150830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150830
Approved by: https://github.com/anmyachev, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #149862
2025-04-11 04:30:46 +00:00
d186c933f8 [Inductor UT][Break XPU] Apply CUDA tolerances changes on XPU that introduced by #144579. (#149862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149862
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-04-11 04:30:46 +00:00
a22d3e778e [dynamo][guards] Print relational guards only once (#150810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150810
Approved by: https://github.com/anijain2305
2025-04-11 04:10:37 +00:00
8b5e717601 c10d/Store: add clone feature (#150966) (#150966) (#151045)
Summary:
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: https://github.com/pytorch/pytorch/issues/150943

Approved by: https://github.com/fduwjj

Test Plan:
contbuild & OSS CI, see 205881ea4a

Test plan from GitHub:
```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Differential Revision: D72789690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151045
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2025-04-11 04:00:23 +00:00
75162aa7de [ONNX] Support running bfloat16 models with ONNX Runtime (#149646)
Use ORTValue objects to support bfloat16 and other dtypes as inputs. This only supports cuda as ort only implements bfloat16 on cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149646
Approved by: https://github.com/titaiwangms
2025-04-11 03:38:26 +00:00
86370fd658 [dynamo] Allow guards to be dropped with custom filter functions. (#150936)
Summary: A follow up of https://github.com/pytorch/pytorch/pull/150689.

Test Plan: test_dynamo -k test_guard_filter_fn

Differential Revision: D72722322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150936
Approved by: https://github.com/jansel
2025-04-11 03:06:34 +00:00
4b0cf9fc00 Optimize transformer encoder/decoder init suggestion (#146882)
Fixes #72253

Add hint message for users to manually initialize after created.

## Test Result

**Before**

![image](https://github.com/user-attachments/assets/1914223f-008e-4ff7-aea1-c54c55679f65)

![image](https://github.com/user-attachments/assets/fd4110c1-26f7-48fe-9582-80581ab72328)

**After**

![image](https://github.com/user-attachments/assets/12270ba2-b384-4fe6-b351-4287b272d102)

![image](https://github.com/user-attachments/assets/0194e3a0-700a-40da-a9de-e9854c2d5d2e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146882
Approved by: https://github.com/jbschlosser
2025-04-11 02:31:56 +00:00
1e92579126 Add torch._scaled_mm for CPU (#150410)
This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975.

This PR is to add torch._scaled_mm for CPU backend.

_scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410
Approved by: https://github.com/atalman
2025-04-11 02:23:03 +00:00
24ca7e91e6 [1/N] Use internal linkage in torch/csrc C++ files. (#150930)
Turn more functions and variables into static if they are not used outside the cpp files. Unused functions are removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150930
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-11 02:19:31 +00:00
48132de4af [c10d][fr] Fix the false positive in the dtype check in fr analysis script (#151063)
When checking dtype in fr analysis script, we should only check it when the input of output numbel is larger than zero. For the case when it is gather or scatter, the output/input size will be an empty list for non-src or non-dst ranks which we should just skip the check.

Differential Revision: [D72826823](https://our.internmc.facebook.com/intern/diff/D72826823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151063
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-04-11 02:11:58 +00:00
df4e5294a6 Reapply "ProcessGroupGloo: support lazy_init (#150801)" (#151031)
This reverts commit 73f3d6d9aaa128d9917e8b3790933ba2855066cc.

Reapplies #150801

Test plan:

See #150801

submodule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031
Approved by: https://github.com/fduwjj
2025-04-11 01:58:35 +00:00
b7c0fda163 [MPS] Fix determine_backend_memory_format logic (#151042)
If input is channels last than MPS will return a channels last output

This fixed `GPUTests.test_convolution_4_mps` from test_torchinductor.py

That previous failed with
```
AssertionError: expected size 3==3, stride 1==192 at dim=1; expected size 12==12, stride 48==16 at dim=2; expected size 16==16, stride 3==1 at dim=3
```
As FakeTensor implementation of conv returned `Contiguous`, rather than `ChannelLast` layout on MacOS-15 or later.
This doesn't seem to be very well documented, so will try to document the call path for `ExternKernel` invocation for `aten::convolution`:
 - First inductor decomp defined here is called
 c93e4b8290/torch/_inductor/kernel/conv.py (L424-L425)

- Then it goes thru FakeTensor decomposition implemented here
320914f1b6/torch/_subclasses/fake_impls.py (L739-L740)
- Finally it goes down to convolution meta registrations implemented here
320914f1b6/torch/_meta_registrations.py (L2416-L2417)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151042
Approved by: https://github.com/dcci
2025-04-11 01:51:34 +00:00
320914f1b6 [c10d][libuv] Add back correct EOF case check (#151052)
We removed the wrong EOF case in https://github.com/pytorch/pytorch/pull/150987, and we added the correct one back in this PR. Since https://github.com/pytorch/pytorch/pull/150987 is a fix, so we merge that PR first and use this PR as a follow-up to further makes the logic more complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151052
Approved by: https://github.com/XilunWu
2025-04-11 01:37:30 +00:00
c93e4b8290 [BC-breaking] Set NonStrict as default for export_for_training (#150941)
Summary:
- Flip default value of `strict` argument from True to False on torch.export.export_for_training API
- All callsites have been updated to provide this argument explicitly to avoid behavior change.
- If you see any breakages, that means you may have a new callsite that is missed, please set `strict=True` explicitly to the callsite to mitigage.

Test Plan: CI

Differential Revision: D72724975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150941
Approved by: https://github.com/ydwu4
2025-04-11 00:50:05 +00:00
e945247f05 Revert two recent prologue prs (#151013)
These were landed in a bit of a rush to try to make the release.. Reverting, then will re-land with https://github.com/pytorch/pytorch/pull/151009 applied, and do full benchmark run with max-autotune.

Differential Revision: [D72791103](https://our.internmc.facebook.com/intern/diff/D72791103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151013
Approved by: https://github.com/zou3519
2025-04-10 23:48:41 +00:00
c9a35c2a6e [C10D] Document object collectives limitations (#150815)
Adds louder warning labels in the doc page and docstring for object
collectives in hopes of raising awareness of several footgun issues
including accidental creation of cuda contexts by serializing and
sending 'device-local' gpu tensors over the object-* apis.

Preview:
<img width="902" alt="image" src="https://github.com/user-attachments/assets/e0c08c70-d8e5-4e15-b3e2-5cd563714f71" />

addresses #150798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150815
Approved by: https://github.com/kwen2501
2025-04-10 22:48:39 +00:00
dbcd0b571d Back out "[AOTI] Always use oss schema for ExternKernelNodes serialization" (#151026)
Summary: Revert for FC breaking

Test Plan: CI

Differential Revision: D72802075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151026
Approved by: https://github.com/hl475
2025-04-10 22:36:35 +00:00
f304483e95 [ONNX] Add asdict method to VerificationInfo class (#151024)
This pull request introduces a new method to convert `VerificationInfo` objects to dictionaries and includes a corresponding test to ensure the method works correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151024
Approved by: https://github.com/titaiwangms
2025-04-10 22:23:33 +00:00
8d81806211 [inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888)
context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888
Approved by: https://github.com/jansel
2025-04-10 22:10:55 +00:00
e786b3bf54 Revert "[inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888)"
This reverts commit 115a165f9b24e3aaaeb2d0994678116758bd636f.

Reverted https://github.com/pytorch/pytorch/pull/150888 on behalf of https://github.com/malfet due to This indeed broke all those inductor tests ([comment](https://github.com/pytorch/pytorch/pull/150888#issuecomment-2795231901))
2025-04-10 21:46:23 +00:00
6a65f2c4fe Revert "Support tuning of _scaled_grouped_mm (#150421)"
This reverts commit 8efcf21fff327d155350bf26ccba769bab58c077.

Reverted https://github.com/pytorch/pytorch/pull/150421 on behalf of https://github.com/malfet due to Looks like it broke lint, see a0ab243c3a/1 ([comment](https://github.com/pytorch/pytorch/pull/150421#issuecomment-2795218547))
2025-04-10 21:36:41 +00:00
a0ab243c3a Revert "Generalize poison fork logic for each device backend (#144664)"
This reverts commit 83bd0b63b55f224fada6d5f6dd7eb5b4cb3072fb.

Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2795157082))
2025-04-10 21:02:14 +00:00
8efcf21fff Support tuning of _scaled_grouped_mm (#150421)
This includes the default aten implementation, as well as a Triton
implementation imported from FBGEMM
(https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421
Approved by: https://github.com/ngimel
2025-04-10 20:34:16 +00:00
abe41c5c9c Revert "c10d/Store: add clone feature (#150966)"
This reverts commit 205881ea4a451574c3a3de87c42484043a955d6e.

Reverted https://github.com/pytorch/pytorch/pull/150966 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150966#issuecomment-2795063574))
2025-04-10 20:17:53 +00:00
8fdd61bc45 Fix torchscript issues with reference quantized modules (#150870)
Summary:
The reference quantized modules for linear / conv / etc fail to torchscript due to two issues

(1) The type of torch.qscheme doesn't script
(2) The "_DTYPE_TO_QVALUE_BOUNDS" values were resolving to union[float, int] instead of just int. We fix that with a hard cast.

See: <internal post> + comments for more context

Test Plan: unit tests + fixing this NB N6923590

Differential Revision: D72652616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150870
Approved by: https://github.com/jerryzh168
2025-04-10 20:14:45 +00:00
31162214d8 Revert "[AOTI] Remove typedef for half and bfloat16 (#150657)"
This reverts commit 357814c85c00a2b5b3fb9add97735e4789caa7e0.

Reverted https://github.com/pytorch/pytorch/pull/150657 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150657#issuecomment-2795042772))
2025-04-10 20:08:03 +00:00
252029b294 [Inductor] assert fallback output alignment (#150804)
Previous PR (https://github.com/pytorch/pytorch/pull/150777) fixes the alignment problem for fallback kernel assuming meta kernel is correct. This PR handles the case that meta kernel is incorrect. Assertion is added if the compiler assumes a  fallback kernel output is aligned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150804
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #150777
2025-04-10 20:01:06 +00:00
115a165f9b [inductor] Change minimum number of SMs to 60 to let Ada use Triton GEMM backend (#150888)
context: https://github.com/pytorch/pytorch/issues/150390#issuecomment-2790272814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150888
Approved by: https://github.com/jansel
2025-04-10 19:46:35 +00:00
4161c752bb [dynamo] unpack sequence lazily for list extend/deque extendleft (#150965)
Fixes https://github.com/pytorch/pytorch/issues/133063.

We were unpacking generators/iterators eagerly when we should be unpacking them one-by-one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150965
Approved by: https://github.com/jansel
2025-04-10 19:31:31 +00:00
389cd15265 [export] check tuple length mismatch for dynamic_shapes spec (#150976)
Summary: weren't checking this

Test Plan: test_export

Differential Revision: D72761995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150976
Approved by: https://github.com/angelayi
2025-04-10 19:08:43 +00:00
f663aa4e81 [c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987)
While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.

> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).

We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.

Added a unit test to test this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-10 18:48:57 +00:00
e7ed50f27b [async TP] Fix handling of case where scatter dim = 0 for 2D output tensor (#150935)
## Summary of changes

1. Change assertion to a warning, when no all gather or reduce scatter patterns are found, and remove the corresponding unit test. It seems some valid TP graphs may not have any pattern matches, from what I can see.
2. Fix wrong variable name being referenced (`A_with_scatter_dim_0` instead of just `A`)
3. Simplify reshaping to target output shape (don't need to recalculate output shape)
4. When "A" tensor is 2D, so we are doing doing a 2D x 2D scaled mm, we need to fix our handling of the case where the scatter dim is 0. When scatter dim is 0 for the 2D scaled mm output shape, this is actually dim 1 in the unreduced stacked partial scaled mm outputs, which has a (logical) shape of `(group_size, M//group_size, N)`. To summarize:
    - Unreduced stacked partials are of shape `(M, N)`
    - We view as `(group size, M//group_size, N)` and reduce along the scatter dim (`group_size` / dim 0).
    - Reduced output (`reduced_out`) has shape (M//group_size, N)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150935
Approved by: https://github.com/lw
2025-04-10 18:25:48 +00:00
08831f30bb [Intel GPU] Allow XPU backend in Depthwise_conv2d&3d operators (#149114)
This modification is to support XPU kernels for depthwise_conv2d and depthwise_conv3d.
Currently, when running depthwise_conv on XPU devices, it is calculated with Mkldnn via the ConvBackend::Overrideable path.
After this modification, depthwise_conv will be calculated directly using XpuDepthwise3d when the Mkldnn backend is disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149114
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-04-10 17:49:27 +00:00
37812009fd [profiler] don't disable CUPTI_LAZY_REINIT for cuda >= 12.6 (#150957)
Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604

Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed.

This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this.

Differential Revision: [D72745929](https://our.internmc.facebook.com/intern/diff/D72745929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150957
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
2025-04-10 17:45:01 +00:00
6720d23969 Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690)
Detail of the issue:

If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group.

I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream);
send(..., comm1, stream);
abort(comm1);
abort(comm0);

Fixes #119797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman
2025-04-10 17:33:26 +00:00
1250106630 [pytorch] Remove numpy dependency from Knapsack Evaluator (#150825)
Summary:
The two implementations are functionally equivalent. They both calculate the memory budget at the knee point in the Pareto frontier using the same algorithm.

1. np.linspace -> basic list comprehension
2. runtime and memory values -> lists instead of numpy arrays
3. np.ptp -> max - min
4. np.norm -> diff with min value / range
5. np.sqrt -> **0.5
5. np.argmin -> .index(min(_))

Test Plan:
# Unit Testing

```
buck test mode/opt //caffe2/test/functorch:test_ac_knapsack; pingme "tests done"
Buck UI: https://www.internalfb.com/buck2/f4e41eb8-e775-4f04-b4e7-8e567599deb8
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099236155875
Network: Up: 24KiB  Down: 1.9GiB  (reSessionID-7cd11487-f3e7-43ab-982a-805510771c8d)
Executing actions. Remaining      0/259826                                                                                                  98:15:40.5s exec time total
Command: test.     Finished 3 local, 5 remote, 103467 cache (99% hit)                                                                       98:15:14.8s exec time cached (99%)
Time elapsed: 1:09.9s
Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0
```

# End to End Testing

### Baseline Run with DP

Let's confirm everything we are running on works.

- Optimization Algo: DP
- Memory Budget: 0.05
- AIX Link: apf_local-basilwong-2025-03-22_20:39:10
- TLParse rank 0: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpDJaWp5/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
- TLParse rank 1:  https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpDJaWp5/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

### Dynamic Memory Budget (Before Change)

- Revision: 2c95489b7f79
- Optimization Algo: Dynamic Memory Budget
- Memory Budget: 0.05
- AIX Link: https://www.internalfb.com/mlhub/pipeline/4088035428184866
- TLParse:
   - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpykEy8U/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
   - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpykEy8U/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

### Dynamic Memory Budget (After Change)

- Revision: 14353eef3c9e
- Optimization Algo: Dynamic Memory Budget
- Memory Budget: 0.05
- AIX Link: https://www.internalfb.com/mlhub/pipeline/1613558749306737
- TLParse Links:
   - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpZKNWFw/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
    - https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpZKNWFw/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

As a sanity check lets take the AC information for the following compile id: 7_0_0 from the rank 0 of each TLParse.

 {F1976883124}

* Baseline: P1779400819
   * Saved node values show we are storing much more compared to dynamic memory:

```
  "Knapsack Saved Nodes": [
    16,
    17,
    19,
    20,
    21,
    22,
    24,
    25,
    26,
    27,
    28,
    29,
    30,
    31,
    32,
    33,
    34,
    35,
    36,
    37,
    38,
    39,
    40,
    41,
    42,
    43,
    44,
    45,
    46,
    47,
    49,
    50,
    51,
    52,
    53,
    54,
    55,
    56,
    57,
    58,
    59,
    60
  ]
```

* Before Change: P1779401775
   * Saved nodes are similar to after change but not exactly.

```
  "Knapsack Saved Nodes": [
    24,
    25,
    26,
    27,
    28,
    29,
    30,
    31,
    32,
    33,
    34,
    35,
    36,
    37,
    38,
    39,
    40,
    41,
    42,
    43,
    44,
    45,
    46,
    47,
    49,
    50
  ]
```

* After Change: P1779402106
   * Here we se the largest nodes that are saved are around the same, but there is a small discrepancy for the smallest nodes.

```
  "Knapsack Saved Nodes": [
    24,
    25,
    26,
    27,
    28,
    29,
    30,
    31,
    32,
    33,
    34,
    35,
    36,
    37,
    38,
    39,
    40,
    41,
    42,
    43,
    44,
    45,
    46,
    47,
    50,
    51,
    57,
    58,
    59,
    60,
    61,
    62
  ],
```

The discrepancy can be explained by looking at the estimated memory values. This is the non-deterministic part(below are the top 5 memory values for considered candidates):

```
    0.05774741703905514,
    0.007333005338292718,
    0.007333005338292718,
    0.007333005338292718,
    0.007333005338292718,
```

vs

```
    0.049254204820440746,
    0.006254502199421049,
    0.006254502199421049,
    0.006254502199421049,
    0.006254502199421049,
```

Based on that the dynamic memory implementations performed  similarly in an E2E test and that memory is non-deterministic we should be good to go to land.

Differential Revision: D71692245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150825
Approved by: https://github.com/seemethere, https://github.com/jansel
2025-04-10 17:07:03 +00:00
5471e80fb4 Remove guard_size_oblivious from vector_norm decomposition. (#148809)
This PR remove the usage of guard_size_oblivious in vector_norm by inlining it in the runtime check,
this prevent any data dependent error from ever appearing here at the locations where guard_size_oblivious
used to exist. Before this PR it used to break potentially. This is NOT BC breaking or changing of semantics from eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148809
Approved by: https://github.com/bobrenjc93
2025-04-10 16:19:00 +00:00
e6969c1bd8 [export] Symint support (nonstrict, Dim.DYNAMIC) (#150198)
Fixes https://github.com/pytorch/pytorch/issues/113682 only in the non-strict export case. Also we only support Dim.DYNAMIC/AUTO, not named-Dims

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150198
Approved by: https://github.com/pianpwk
2025-04-10 15:06:23 +00:00
596e44d26a [inductor] Enable docstring_linter on _inductor (#144622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144622
Approved by: https://github.com/eellison
ghstack dependencies: #144621
2025-04-10 14:32:26 +00:00
ba35793226 [inductor] Add tests for new docstring_linter features (fix #142496) (#144621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144621
Approved by: https://github.com/eellison
2025-04-10 14:32:26 +00:00
73f3d6d9aa Revert "ProcessGroupGloo: support lazy_init (#150801)"
This reverts commit f237ee54bfb35d16cd10e358d4b78578c88a5781.

Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))
2025-04-10 13:44:31 +00:00
7b7b9d707e [CI] Add XPU compiled check in CICD (#150771)
Address the suggestion from https://github.com/pytorch/pytorch/issues/150001#issuecomment-2753407421

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150771
Approved by: https://github.com/malfet, https://github.com/atalman
2025-04-10 13:33:27 +00:00
4273e5d15c Expose is_available API for torch.backends.mkldnn (#147432)
As the title stated.

Like torch.backends.mkl, torch.backends.openmp and so on, they all expose
is_available API for users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147432
Approved by: https://github.com/albanD
2025-04-10 05:05:37 +00:00
1a1a32ce5a [elastic][test] fix race condition in test_barrier_timeout_rank_tracing (#150768)
# Root cause
The barrier timeout set to 0.1 is too short, some threads may not have enough time to reach the barrier.

# How to reproduce
Adding some sleep will be easy to reproduce.
```python
    def test_barrier_timeout_rank_tracing(self):
        N = 3

        store = dist.HashStore()

        def run_barrier_for_rank(i: int):
            if i != 0:
                import time;time.sleep(1)  # Let some thread sleep for a while
            try:
                store_util.barrier(
                    store,
                    N,
                    key_prefix="test/store",
                    barrier_timeout=0.1,
                    rank=i,
                    rank_tracing_decoder=lambda x: f"Rank {x} host",
                    trace_timeout=0.01,
                )
            except Exception as e:
                return str(e)
            return ""

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150768
Approved by: https://github.com/d4l3k
2025-04-10 04:40:16 +00:00
a6933a1c42 [Inductor] Remove triton dtype patch which has landed (#149611)
As this [pr][0] has already landed, we should remove its patch.

Having [mentioned][1] this before, I am making this change now to avoid omissions.

[0]: https://github.com/triton-lang/triton/pull/3342
[1]: https://github.com/pytorch/pytorch/pull/147583/files#r1970440062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149611
Approved by: https://github.com/eellison
2025-04-10 03:42:55 +00:00
b80bb87689 cpp_wrapper: Miscellaneous fixups (#150143)
1. Revisit preprocessing code in cpp_bulider.py, removing a hack that channels it through stdout.
2. Fix ops that return None.

Differential Revision: [D72053414](https://our.internmc.facebook.com/intern/diff/D72053414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150143
Approved by: https://github.com/desertfire
2025-04-10 03:31:12 +00:00
cd80778ac8 Fix issue in optimized_add issue: make_optimized should be called on non args only (#150955)
PR https://github.com/pytorch/pytorch/pull/149665 did a change to the optimized_add that is causing an issue internally.
In general make_optimized should be only be called with valid new_args,  new_args can become None
when elements already exists also, we should break out of the loop in that case.

Note that I also only maintained the optimized summation when both lhs and rhs lengths are <=2.
This is ok because the optimization is based on the inductive property of adding one symbol at a time.
the [2]+[2] here is serving as base case ( i feel we can also remove it ) .

Note that keeping it for all sizes while correct, I am not sure if tis as efficient (we will do N log(n) insertions).
there is no current justification for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150955
Approved by: https://github.com/Mingming-Ding, https://github.com/atalman, https://github.com/bobrenjc93
2025-04-10 03:00:21 +00:00
bf7d8ef10d [Docs] Clarify behavior when integer dtype is used with requires_grad=True in tensor.to() (#150913)
Fixes #150618

Related comment: https://github.com/pytorch/pytorch/issues/3226#issuecomment-489362234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150913
Approved by: https://github.com/janeyx99, https://github.com/soulitzer, https://github.com/cyyever

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-04-10 02:52:58 +00:00
78b3d71ece Docs: Add missing whitespace in the cmake warning message (#150929)
A trailing whitespace is needed to be concatenated to the following string correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150929
Approved by: https://github.com/Skylion007
2025-04-10 02:50:56 +00:00
3d3fcaaf7b Delegate torch.accelerator.device_count to torch.xxx.device_count for multi-process usage (#149924)
# Motivation
Adapt `torch.accelerator.device_count` for multi-process usage. For example, `torch.cuda.device_count` avoids poisoning fork, then `torch.accelerator.device_count` should meet the same requirement.
Now that `torch.get_device_module(device).device_count` supports this, `torch.accelerator.device_count` should align with this behavior as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149924
Approved by: https://github.com/albanD
ghstack dependencies: #147507
2025-04-10 02:37:37 +00:00
6972255dad Document poison fork note for accelerator APIs (#147507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147507
Approved by: https://github.com/sraikund16, https://github.com/kwen2501, https://github.com/albanD
2025-04-10 02:37:37 +00:00
83bd0b63b5 Generalize poison fork logic for each device backend (#144664)
# Motivation
Generalize the posion_fork code to make it reusable across different devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-04-10 02:34:53 +00:00
cyy
322f883c0c Remove unneeded CUDA logic from _create_build_env (#145822)
Because FindCUDAToolkit.cmake has that logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145822
Approved by: https://github.com/albanD
2025-04-10 02:17:28 +00:00
cyy
54827752a4 [5/N] Remove unnecessary once flag usage (#147445)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147445
Approved by: https://github.com/albanD
2025-04-10 01:48:10 +00:00
205881ea4a c10d/Store: add clone feature (#150966)
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: https://github.com/pytorch/pytorch/issues/150943

Test plan:

```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150966
Approved by: https://github.com/fduwjj
2025-04-10 01:41:50 +00:00
061832bc7a Gracefully handle optree less than minimum version (#150956)
Summary:
- We are saying the minimum version of pytree that PyTorch can use is
  0.13.0
- If a user imports torch.utils._cxx_pytree, it will raise an
  ImportError if optree doesn't exist or exists and is less than the
  minimum version.

Fixes https://github.com/pytorch/pytorch/issues/150889. There are
actually two parts to that issue:
1. dtensor imports torch.utils._cxx_pytree, but the optree installed in
   the environment might be too old. Instead, raising ImportError in
   torch.utils._cxx_pytree solves the issue.
2. We emit an "optree too low version" warning. I've deleted the
   warning in favor of the more explicit ImportError.

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150956
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/XuehaiPan
2025-04-10 01:22:50 +00:00
9d1528186f Fix static functions when using module in MSVC (#148675)
If you try to use torch in c++ using modules then it will not compile due to static function not being supported in MSVC when using modules https://developercommunity.visualstudio.com/t/10323558.

It's also aligned with [C++20 standard](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/n4849.pdf) (ISO/IEC 14882:2020) 10.2.7 Export declaration [module.interface]: "Exported names have either external linkage or no linkage".

Fixes https://github.com/pytorch/pytorch/issues/71309
Tested using the following code.

```c++
export module testModule;

import <torch/torch.h>;
import <memory>;
import <string>;
import <tuple>;
import <iostream>;

export namespace testModule
{

    export void test()
    {
        torch::Tensor tensor1 = torch::rand({ 2, 3 });
        torch::Tensor tensor2 = torch::rand({ 3, 2 });
        // Perform tensor multiplication
        torch::Tensor result = torch::matmul(tensor1, tensor2);

        // Print the tensors
        std::cout << "Tensor 1: " << tensor1 << std::endl;
        std::cout << "Tensor 2: " << tensor2 << std::endl;
        std::cout << "Result of multiplication: " << result << std::endl;
    }
}
```

```c++
import testModule;

int main()
{
	testModule::test();
	return 0;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148675
Approved by: https://github.com/albanD, https://github.com/malfet

Co-authored-by: mantaionut <ionut@janeasystems.com>
2025-04-10 01:19:54 +00:00
69cee91a55 Code Clean: Using the new builtin function provides by python 3.8 later (#150839)
Changes:
- reversed
- math.perm
- inspect.getfile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150839
Approved by: https://github.com/Skylion007
2025-04-10 01:17:39 +00:00
f3cf3ec591 [AOTInductor] Add User Managed buffer for AOTI constant buffer. (#150276)
Summary:
We add the functionality to allow users to directly pass in a at::Tensor
into AOTInductor, that would be used as the constant.
This user managed buffer skips the copying step in AOTInductor, and let
users to directly manage the memory usage themselve.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/data/users/$USER/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-10 00:15:44 +00:00
92e81cf41a Add real_tensor to the FakeTensor in node.meta["val"] (#150948)
Summary: We need real_tensor on the FakeTensor in node.meta["val"] in order to aot_compile the draft exported programs. Otherwise, we cannot propagate real tensors even when fake_mode.propagate_real_tensors = True.

This also fixes real tensor propagation in `run_decomposition()`.

Test Plan:
```
 buck2 run @mode/dev-nosan  caffe2/test:test_export -- -r test_dedup_data_dependent_failure
```

Differential Revision: D72732714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150948
Approved by: https://github.com/angelayi
2025-04-10 00:11:46 +00:00
91d1826539 Add dynamic version for mm_loop benchmark (#150865)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150865
Approved by: https://github.com/eellison
2025-04-09 23:37:43 +00:00
a8b48ff14c [DTensor] clean up _local_shard_size_and_offset (#150650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150650
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #150490
2025-04-09 22:07:48 +00:00
3532dd4f1e [DTensor] StridedShard support uneven sharding (#150490)
This enables using FSDP+TP on parameters with dimensions that aren't
evenly divisible by the DP/TP mesh sizes.

- this may not support all possible combinations of strided shardings
  and shardings, but the support before this PR is not complete anyway

This contains several fixes for different aspects of DTensor behavior
relating to uneven strided sharding:
- original creation of the strided tensor requires fixes in
  StridedShard._split_tensor
- full_tensor() reconstruction requries fixes in
  StridedShard._to_replicate_tensor to correctly reshuffle the data into
  the original pre-sharded order
- Distributed Checkpointing support requires correct computation of the
  compute_local_shape_and_global_offset util so it knows how a local
  shard maps to the global tensor, for reconstruction during
  load/reshard.

This PR also adds a util `_explicit_order_placements` which converts a list of
placements with StridedSharding into a list of placements with only
regular sharding, with the order shuffled such that it is equivalent.

Builds on and completes the work started in https://github.com/pytorch/pytorch/pull/148894

Uneven Sharding Example
-------
(copied from _StridedShard._to_replicate_tensor docstring)

mesh = (DP=2, TP=2)
original = torch.arange(5)

**Applying Sharding**

Step 1 - Apply TP sharding
`tp = distribute_tensor(x, world_mesh['tp'], [Shard(0)])`

local_tensors:
rank0: [0,1,2]    rank1: [3,4]
rank1: [0,1,2]    rank3: [3,4]

Step 2 - Apply FSDP sharding
`dp_tp = ...` (the process of creating a strided-shard tensor is skipped over as it is hacky and complicated)
dp_tp has placement (_StridedShard(0, split_factor=2), Shard(0))
local_tensors:
rank0: [0,1]  rank1: [3]
rank1: [2]    rank3: [4]

**Reconstructing the Full Tensor**
Now, say someone wants to reconstruct dp_tp's full tensor. This will invoke 'redistribute' to replicate.
redistribute will first replicate the "Shard(0)" placement on the rightmost mesh dim, then replicate the
StridedShard placement second, which is implemented by this function.
So our starting point (`local_tensor` arg) is the result of replicating the Shard(0) placement across the
TP dim, which looks like this.

Note the discrepancy with the 'tp sharded tensor' line above!  We'll fix it by locally shuffling data.

local_tensors:
rank0: [0,1,3]  rank1: [0,1,3]
rank1: [2,4]    rank3: [2,4]

Step 1: replicate over the DP dimension.  Afterwards, each rank can locally sort the values.
  note: we need padding to do this allgather, and we'll need to keep track of the padding amount for later
	local_tensors:
rank0: [0,1,3,2,4]    rank1: [0,1,3,2,4]
rank1: [0,1,3,2,4]    rank3: [0,1,3,2,4]

Step 2: chunk and shuffle values around to account for the wrong order of operations above
and get the original tensor content back

01324#       <- our allgather includes padding, if padding was applied in step 1
01324        <- Remove the padding
013, 24      <- chunk once, 'undoing' the DP allgather
01, 3, 2, 4  <- chunk each chunk, 'undoing' the initial (wrong) TP allgather performed by Shard(0)->Replicate()
012, 34      <- interleave with stride=TP mesh dim size
01234        <- concatenate

Co-authored-by: Luca Wehrstedt <lw@meta.com>
Co-authored-by: Will Constable <whc@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2025-04-09 22:07:48 +00:00
cc2decdb25 [CI][CUDA][Distributed]Update test_composability.py (#148578)
world_size = int(os.getenv("WORLD_SIZE", 4)) in subsequent lines indicate the tests in this file do not only require > 1 GPU, but at least 4 GPUs.  skip_if_lt_x_gpu(4) does not properly skip this on a platform with 2 GPUs.

skip_if_lt_x_gpu being broken, potentially related to a similar issue: https://github.com/pytorch/pytorch/issues/146094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148578
Approved by: https://github.com/atalman
2025-04-09 21:57:05 +00:00
786422a4d7 Remove a workaround added in #149381 (#150693)
Remove a workaround added in https://github.com/pytorch/pytorch/pull/149381.

Fixes https://github.com/pytorch/xla/issues/8934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150693
Approved by: https://github.com/albanD
2025-04-09 21:48:03 +00:00
087e8587cd support backed_size_oblivious in guard_or_false/guard_or_true (#150231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150231
Approved by: https://github.com/pianpwk
2025-04-09 21:47:20 +00:00
31fe258efc [inductor] Add features to docstring_linter (see #142496) (#145834)
## Improvements to `docstring_linter`

* Add a "grandfather list" of existing undocumented classes and functions (`--grandfather`, `--grandfather-tolerance`, `--no-grandfather`, `--write-grandfather`)
* In classes, now just one of the class itself or its `__init__()` method needs to be documented (`--lint-init` turns the old behavior back on)
* Now classes and functions defined local to other functions do not need to be documented (`--lint-local` turns the old behavior back on)
* New `--report` flag produces a compact report of long, undocumented classes or function definitions: see attached example run over all pytorch: [pytorch-docs.json](https://github.com/user-attachments/files/18455981/pytorch-docs.json)

## Help text

```
$ python tools/linter/adapters/docstring_linter.py --help
usage: docstring_linter.py [-h] [-l] [-v] [--grandfather GRANDFATHER] [--grandfather-tolerance GRANDFATHER_TOLERANCE] [--lint-init]
                           [--lint-local] [--lint-protected] [--max-class MAX_CLASS] [--max-def MAX_DEF]
                           [--min-docstring MIN_DOCSTRING] [--no-grandfather] [--report] [--write-grandfather]
                           [files ...]

`docstring_linter` reports on long functions, methods or classes without docstrings

positional arguments:
  files                 A list of files or directories to lint

optional arguments:
  -h, --help            show this help message and exit
  -l, --lintrunner      Run for lintrunner and print LintMessages which aren't edits
  -v, --verbose         Print more debug info
  --grandfather GRANDFATHER, -g GRANDFATHER
                        Set the grandfather list
  --grandfather-tolerance GRANDFATHER_TOLERANCE, -t GRANDFATHER_TOLERANCE
                        Tolerance for grandfather sizes, in percent
  --lint-init, -i       Lint __init__ and class separately
  --lint-local, -o      Lint definitions inside other functions
  --lint-protected, -p  Lint functions, methods and classes that start with _
  --max-class MAX_CLASS, -c MAX_CLASS
                        Maximum number of lines for an undocumented class
  --max-def MAX_DEF, -d MAX_DEF
                        Maximum number of lines for an undocumented function
  --min-docstring MIN_DOCSTRING, -s MIN_DOCSTRING
                        Minimum number of characters for a docstring
  --no-grandfather, -n  Disable the grandfather list
  --report, -r          Print a report on all classes and defs
  --write-grandfather, -w
                        Rewrite the grandfather list
```

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145834
Approved by: https://github.com/amjames, https://github.com/eellison
2025-04-09 21:38:36 +00:00
357814c85c [AOTI] Remove typedef for half and bfloat16 (#150657)
Summary: typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150657
Approved by: https://github.com/malfet
2025-04-09 21:21:17 +00:00
d751698a36 Support negative values for fill with uint tensors (#144458)
Fixes https://github.com/pytorch/pytorch/issues/144188
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144458
Approved by: https://github.com/amjames, https://github.com/eellison
2025-04-09 21:08:06 +00:00
860765d621 update benchamark result due to <1% regression (#150937)
<img width="1503" alt="Screenshot 2025-04-09 at 9 07 13 AM" src="https://github.com/user-attachments/assets/e16f31b0-c5dc-4dd6-8adb-aac11ed988db" />

PR https://hud.pytorch.org/pr/148104
which is acceptable but we have to update this to avoid  flakiness in the future .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150937
Approved by: https://github.com/zou3519
2025-04-09 20:25:48 +00:00
2b9d8a5633 Fix -Wmissing-braces in a few files (#150802)
Test Plan: Sandcastle

Reviewed By: wenxin0319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150802
Approved by: https://github.com/Skylion007
2025-04-09 20:15:34 +00:00
ea0cbba1fc [export] Refine draft-export CVE with Dim.AUTO (#150876)
Instead of using refine_dynamic_shapes_from_suggested_fixes to fix ConstraintViolationErrors in draft-export, we can just convert the dims to Dim.AUTO, which is less error prone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150876
Approved by: https://github.com/pianpwk
2025-04-09 19:44:30 +00:00
f237ee54bf ProcessGroupGloo: support lazy_init (#150801)
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)`

This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first

This also updates the gloo submodule to include the required changes.

Test plan:

added lazy init test variants

```
pytest -v test/distributed/test_c10d_gloo.py -k Lazy
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801
Approved by: https://github.com/fduwjj
2025-04-09 19:29:50 +00:00
a4545f09da [Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/test/export (#150884)
Differential Revision: D72667175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150884
Approved by: https://github.com/ydwu4
2025-04-09 19:18:33 +00:00
cfab04d01b Fix aten.div type promotion for FakeTensor (#150874)
Summary:
When we divide a FakeTensor by an integer using the fast op implementation, the type promotion should be `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` so we get a float when dividing an int FakeTensor by an integer.

```
FAST = get_fast_op_impls()
fast_div = FAST[torch.ops.aten.div.Tensor]
fast_div(fake_tensor, some_int)
```

Test Plan:
```
python test/test_fake_tensor.py -k test_fast_div
```

Differential Revision: D72667430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150874
Approved by: https://github.com/angelayi
2025-04-09 18:52:01 +00:00
d3a2872c67 Hipify global scrach defintion in AOTI codegen (#150893)
Summary: as title, a refactor is very needed I think .... or at least unify internal/external AOTI wrapper hipification method

Test Plan: P1780296121

Differential Revision: D72683568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150893
Approved by: https://github.com/davidberard98
2025-04-09 18:35:36 +00:00
d04a6ec021 add reduce_scatter to symm mem ops (#150813)
+ a few small fixes (don't error out on 0-element tensors, a few more checks for contiguous outputs, more threads for better perf).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150813
Approved by: https://github.com/xw285cornell
2025-04-09 17:59:17 +00:00
cc185c32e0 [aoti] Use generate_fake_kernels_from_real_mismatches config for draft exported programs (#150651)
Summary:
Sometimes we get `MetadataMismatchError` in aoti compilation because draft export uses the flag below to infer the fake kernel when there’s a mismatch, but aoti doesn’t have this flag turned on.

https://fburl.com/code/9qzytl6q
 torch._functorch.config.generate_fake_kernels_from_real_mismatches

If we set this flag to True, then aoti compilation would work.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts
```

Differential Revision: D72345085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150651
Approved by: https://github.com/angelayi
2025-04-09 17:28:29 +00:00
6fb089f2a2 [AO] fix per token block size calculation (#150890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150890
Approved by: https://github.com/jerryzh168
2025-04-09 17:07:31 +00:00
c59aaa03ff [DTensor] add _explicit_order_placements util (#150493)
The util converts a list of placements in the traditional DTensor format
(e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding
is always applied left-to-right (from dim 0 to higher dims))

to a more explicitly ordered format, also replacing '_StridedShard' with
simple 'Shard' placements in the process.
(e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first
item in the tuple is the mesh_dim and the ordering of the tuples is the
sharding order.

This is useful so far as a helper for fixing local shape computation for
strided sharding in the uneven shape case, in the following PR- but may
also be useful more broadly if we can use explicit orderings to simplify
other parts of DTensor logic.

This skips implementing some combinations of _StridedSharding that are
not currently used in the wild today, but could be supported easily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150493
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2025-04-09 16:55:24 +00:00
01568cb17a Revert "Refactor layout constraint selection logic (#148104)"
This reverts commit 2e7c9d33e7f933ac3b723cb3bb05b9c88432c25c.

Reverted https://github.com/pytorch/pytorch/pull/148104 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](2e7c9d33e7) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))
2025-04-09 16:49:48 +00:00
a0e796df03 Revert "Inductor respects exact strides on custom ops by default (#150511)"
This reverts commit a4bb2f106f8cc642539d4698b6d869a87adca92f.

Reverted https://github.com/pytorch/pytorch/pull/150511 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](2e7c9d33e7) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))
2025-04-09 16:49:48 +00:00
a4bb2f106f Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #150495, #148104
2025-04-09 16:46:48 +00:00
c714d2fc0e [hop] support base_hop._gen_schema (#149688)
This PR creates two utils for generating a schema for hops from example inputs and use base hop as an exmaple.
1. HopArgumentInfoGen creates an argument or an output schema with mutation information.
2. CFuncitonSchemaGen piece together the argument info of inputs and outputs and produces torch._C.FunctionSchema.

is_write attribute of argument info can be computed. Note that the is_write annotation only works when the inputs are flattened (e.g. cannot support mutation inside tuple). We need special handling the case where we have tuple inputs like cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149688
Approved by: https://github.com/zou3519
2025-04-09 16:42:55 +00:00
72755a4b7a Avoid circular imports in tracing_state_functions (#150325)
tracing_state_functions references some torch functions from submodules like `torch.onnx.is_in_onnx_export` that could trigger module initialization & circular imports. I turned the mapping into a function so that the dictionary is not initialized at torch import.

(discovered in https://github.com/pytorch/pytorch/pull/149646)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150325
Approved by: https://github.com/zou3519
2025-04-09 16:32:11 +00:00
8aaf296efc [c10d][fr] Refactor analysis script for modularization and reusing for coalesce collectives (#150881)
Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions.

This PR mostly is shuffle around the code a bit.

Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150881
Approved by: https://github.com/wz337
2025-04-09 16:10:19 +00:00
c8d37b9c85 [ez][c10d] Disable start event recording for coalesced col and improve profile title (#150863)
While looking at enabling FR analysis for coalesced collectives, I found that for the slow-path coalescing (cols which are not all-gather, all-reduce or reduce-scatter), we still record start event for them. This is wrong and we should do the same thing as endEvent recodring.

And I made the profiler title more visible when we pass in the opType for coalesced all-gather and reduce-scatter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150863
Approved by: https://github.com/eqy, https://github.com/d4l3k, https://github.com/kwen2501
2025-04-09 16:09:56 +00:00
1a56609e75 [ONNX] Supporting different opset versions for torchlib registry (#149901)
- Allows opset_version to determine which onnx decomposition to choose
- Adds a cleanup function to modify the registry after it is built

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149901
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2025-04-09 16:03:46 +00:00
97a5e5c6b3 Added _fused_sdp_choice_stub dispatcher support for HPU device (#149512)
Currently for HPU device we don't have any support for _fused_sdp_choice_stub dispatcher function, so for `scaled_dot_product_attention` function by default selecting the `MATH Backend` using `_fused_sdp_choice_stub` for HPU device. With this PR we have enabled support for `_fused_sdp_choice_stub` dispatcher function, so that we can invoke any backend (for example math, flash_attention, efficient_attention, cudnn_attention, overrideable) according to user choice for HPU device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149512
Approved by: https://github.com/drisspg
2025-04-09 15:48:09 +00:00
d0e3482266 Update triton wheel build, setuptools pin (#150931)
Observing failure in release workflow:
https://github.com/pytorch/pytorch/actions/runs/14346340202/job/40216804374

```
Traceback (most recent call last):
  File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 11, in <module>
    from setuptools.command.bdist_wheel import bdist_wheel as bdist_wheel
ModuleNotFoundError: No module named 'setuptools.command.bdist_wheel'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/tmppwpqef_x/triton/python/setup.py", line 27, in <module>
    from wheel.bdist_wheel import bdist_wheel
  File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 13, in <module>
    raise ImportError(ERROR) from exc
ImportError: The 'wheel.bdist_wheel' module has been removed.
Please update your setuptools to v70.1 or later.
If you're explicitly importing 'wheel.bdist_wheel', please update your import to point to 'setuptools.command.bdist_wheel' instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150931
Approved by: https://github.com/Skylion007
2025-04-09 15:26:07 +00:00
5a422150c3 Add torch.triu_indices, torch.tril_indices dtype description (#150749)
Fixes #150675

## Test Result

![image](https://github.com/user-attachments/assets/f30a0de0-6475-4d07-b441-15fffd453ba1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150749
Approved by: https://github.com/bdhirsh
2025-04-09 15:03:24 +00:00
246f3b6530 [Quant][PT2E][X86] enable qconv1d-relu fusion (#150751)
**Summary**
As the title.
- The `conv1d - relu` pattern will be annotated by the `X86InductorQuantizer`.
- The pattern will be fused as `qconv_pointwise` during lowering.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv1d_relu_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150751
Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel
2025-04-09 14:42:02 +00:00
2299087220 [ROCm] Introduce AMD specific inductor gemm tuning (#147315)
Replaces https://github.com/pytorch/pytorch/pull/143286

Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case.

Dynamo huggingface inference benchmarks:
`TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor`

GEOMEAN speedup (before): | 1.35x
GEOMEAN speedup (after): | 1.42x

name | Eager - abs latency | old - abs_latency | old - speedup | new - abs_latency | new - speedup
-- | -- | -- | -- | -- | --
AlbertForMaskedLM | 26.22 | 26.52 | 98.86% | 24.58 | 106.67%
AlbertForQuestionAnswering | 25.96 | 26.40 | 98.33% | 24.10 | 107.73%
AllenaiLongformerBase | 21.03 | 10.65 | 197.50% | 10.49 | 200.58%
BartForCausalLM | 7.77 | 9.76 | 79.63% | 8.79 | 88.46%
BartForConditionalGeneration | 14.44 | 12.86 | 112.26% | 11.96 | 120.70%
BertForMaskedLM | 8.10 | 8.82 | 91.89% | 8.57 | 94.53%
BertForQuestionAnswering | 6.82 | 7.32 | 93.20% | 7.10 | 96.18%
BlenderbotForCausalLM | 10.97 | 11.39 | 96.34% | 10.10 | 108.65%
BlenderbotSmallForCausalLM | 5.91 | 5.44 | 108.72% | 4.82 | 122.67%
BlenderbotSmallForConditionalGeneration | 12.64 | 9.65 | 130.94% | 9.11 | 138.83%
CamemBert | 8.35 | 9.15 | 91.24% | 8.86 | 94.27%
DebertaForMaskedLM | 10.92 | 6.09 | 179.44% | 5.90 | 185.05%
DebertaForQuestionAnswering | 14.29 | 7.70 | 185.59% | 7.26 | 196.75%
DebertaV2ForMaskedLM | 15.47 | 10.22 | 151.32% | 9.34 | 165.55%
DebertaV2ForQuestionAnswering | 14.98 | 6.11 | 245.28% | 6.28 | 238.40%
DistilBertForMaskedLM | 8.37 | 8.70 | 96.30% | 8.22 | 101.92%
DistilBertForQuestionAnswering | 10.21 | 10.54 | 96.88% | 10.39 | 98.36%
DistillGPT2 | 8.77 | 6.78 | 129.40% | 6.31 | 138.88%
ElectraForCausalLM | 10.32 | 4.70 | 219.45% | 4.60 | 224.29%
ElectraForQuestionAnswering | 11.48 | 5.62 | 204.20% | 5.44 | 210.95%
GPT2ForSequenceClassification | 6.21 | 5.72 | 108.50% | 5.58 | 111.26%
GoogleFnet | 26.51 | 20.81 | 127.37% | 19.91 | 133.11%
LayoutLMForMaskedLM | 12.09 | 7.99 | 151.28% | 7.66 | 157.80%
LayoutLMForSequenceClassification | 10.62 | 6.49 | 163.67% | 6.25 | 169.95%
M2M100ForConditionalGeneration | 14.98 | 10.20 | 146.79% | 9.89 | 151.42%
MBartForCausalLM | 7.67 | 9.78 | 78.44% | 8.87 | 86.55%
MBartForConditionalGeneration | 13.45 | 12.69 | 105.99% | 12.03 | 111.82%
MT5ForConditionalGeneration | 19.96 | 5.32 | 375.37% | 5.08 | 393.01%
MegatronBertForCausalLM | 13.22 | 7.86 | 168.07% | 7.18 | 184.01%
MegatronBertForQuestionAnswering | 15.62 | 11.81 | 132.21% | 11.02 | 141.68%
MobileBertForMaskedLM | 26.63 | 10.82 | 245.99% | 11.95 | 222.73%
MobileBertForQuestionAnswering | 23.53 | 7.55 | 311.51% | 9.53 | 247.03%
OPTForCausalLM | 7.33 | 7.64 | 95.93% | 7.56 | 96.90%
PLBartForCausalLM | 8.73 | 7.63 | 114.40% | 7.37 | 118.58%
PLBartForConditionalGeneration | 10.46 | 8.50 | 122.98% | 8.16 | 128.13%
PegasusForCausalLM | 7.18 | 7.37 | 97.42% | 6.64 | 108.22%
PegasusForConditionalGeneration | 16.47 | 16.66 | 98.87% | 14.18 | 116.13%
RobertaForCausalLM | 10.30 | 9.95 | 103.52% | 9.52 | 108.25%
RobertaForQuestionAnswering | 6.37 | 7.13 | 89.28% | 6.79 | 93.87%
T5ForConditionalGeneration | 12.40 | 6.72 | 184.51% | 6.48 | 191.16%
T5Small | 12.02 | 6.66 | 180.55% | 6.32 | 190.33%
TrOCRForCausalLM | 14.12 | 13.31 | 106.11% | 12.45 | 113.41%
XGLMForCausalLM | 16.48 | 6.23 | 264.52% | 6.35 | 259.51%
XLNetLMHeadModel | 74.87 | 62.23 | 120.32% | 57.95 | 129.19%
YituTechConvBert | 20.21 | 10.50 | 192.48% | 9.97 | 202.72%

We are also seeing improvement ~9% on internal addmm benchmark

This PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update.

No CI to test the max-autotune perf currently but this will be enabled via https://github.com/pytorch/pytorch/pull/148672 after which we can investigate more tuning updates and config pruning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147315
Approved by: https://github.com/jansel, https://github.com/eellison
2025-04-09 14:34:30 +00:00
886d9acb0d [docs] Add 32-bit complex to the list of dtypes (#144590)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144590
Approved by: https://github.com/janeyx99
2025-04-09 13:10:21 +00:00
64ac41f68d [pytorch] add header docs for TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150854)
Summary: Add header docs for the experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT feature, and guard behind C10_MOBILE.

Reviewed By: albanD

Differential Revision: D72572345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150854
Approved by: https://github.com/larryliu0820, https://github.com/zou3519
2025-04-09 12:59:24 +00:00
cyy
142f0f86ce Enable modernize-use-default-member-init (#149046)
``modernize-use-default-member-init`` prefers initialisation in class members, that make more ``= default`` constructors possible. Some violations or modernize rules have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149046
Approved by: https://github.com/zou3519
2025-04-09 11:57:24 +00:00
81f60f3880 Expand allowed_getattr_types_for_subgm to torch.Tensor (#150867)
Summary:
att

regular weight has the type of torch.nn.parameter.Parameter
buffer and tensor constant has the type of torch.Tensor

both types are valid.

Test Plan: CI

Differential Revision: D72657275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150867
Approved by: https://github.com/zhxchen17
2025-04-09 11:01:45 +00:00
604467de20 Code Clean: Remove specific bytecode support in dynamo for python3.8 (#150838)
Related Bytecode:
- CALL_FINALLy
- END_FINALLy
- POP_FINALLy

The bytecodes above were removed before python3.9, refer to [this](53908bd790/Misc/NEWS.d/3.9.0a2.rst) for more infos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150838
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #150834
2025-04-09 07:16:52 +00:00
b01877aa13 Fix addbmm & addmv & baddbmm out dtype check (#148176)
----

- torch.addbmm
- torch.addmv
- torch.baddbmm

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148176
Approved by: https://github.com/jansel
ghstack dependencies: #148174
2025-04-09 07:02:56 +00:00
4d6ff6ca5c Fill config2launcher with correct launchers during cache hit coordinate descent (#150860)
This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging.

Here's a short TLDR of the bug:

- Due to D71983456(OSS: https://github.com/pytorch/pytorch/pull/149910), we cache CachingAutotuners in memory.
- Importantly: **Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk**. By saving it in memory, CachingAutotuners do not reset global state.
- It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code.
- Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code)
- CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS: aafc4b6188/torch/_inductor/runtime/coordinate_descent_tuner.py (L69))
- Because we are caching these in memory and not on disk, this cache is **not cleared** between runs.
- However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje
(OSS: aafc4b6188/torch/_inductor/runtime/triton_heuristics.py (L933))
- `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with:

```
AttributeError: 'NoneType' object has no attribute 'store_cubin'
```

This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now.

Note that this error only reproduces if:
- None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory
- We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner.
- The autotune cache doesn't already have the best config stored in it

So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only.

Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150860
Approved by: https://github.com/oulgen
2025-04-09 04:39:37 +00:00
bc47d539fc [MPS] Support ArgumentBuffer bindings from C++/Python (#150780)
To workaround limitation of 32-arguments per kernel and being able to eventually compile something like
```python
import torch

def foo(*args):
  rc = torch.empty_like(args[0])
  for arg in args:
      rc += arg
  return rc

tensors = torch.rand(100, 32, device='mps').unbind(0)
print(torch.compile(foo)(*tensors))
```

For now, introduce `at::native:🤘:get_tensor_gpu_address` and use it from both C++ test and compile_shader to convert list of tensors to list of pointers valid on GPU.

Initially this binding were done via `id< MTLArgumentEncoder>`, but according to [Improving CPU Performance by Using Argument Buffers](https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers?language=objc#Encode-Resources-into-Argument-Buffers) article, this is not necessary when targeting Tier2-only devices (which is true of all devices on MacOS-13 or newer):
> To directly encode the argument buffer resources on these Tier 2 devices, write the [MTLBuffer](https://developer.apple.com/documentation/metal/mtlbuffer?language=objc).[gpuAddress](https://developer.apple.com/documentation/metal/mtlbuffer/gpuaddress?language=objc) property — and for other resource types (samplers, textures, and acceleration structures), the [gpuResourceID](https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/gpuresourceid?language=objc) property — into the corresponding structure member. To encode offsets, treat these property values as uint64 types and add the offset to them.

Add both C++ and PyThon unittests that validate that this works.
Please note, that using either ArgumentEncoder or directly encoding the data does not guarantee buffer will not be freed until shader execution is complete. On the other hand, this should already be guaranteed by MPSCachingAllocator that would only free the memory after all streams completed its execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150780
Approved by: https://github.com/dcci
2025-04-09 04:24:37 +00:00
2e7c9d33e7 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #150495
2025-04-09 02:09:18 +00:00
44deb67830 Fix _del_library (#150495)
On library deletion, we need to clear fx's schema cache.

Test Plan:
- top PR in the stack, I don't have a good test case for this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150495
Approved by: https://github.com/eellison
2025-04-09 02:09:18 +00:00
5f18b7d877 [docs] remove --recursive flag from readme (#150785)
Fixes #150745

See https://github.com/pytorch/pytorch/issues/150745#issuecomment-2784216663

Cloning with `--recursive` as shown in the docs prevents users from checking out commits from before NCCL was removed as a submodule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150785
Approved by: https://github.com/atalman
2025-04-09 02:07:48 +00:00
d9f47c75de Revert "Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690)"
This reverts commit 91173ff89aab5f632d483c736d11d5dcf60decac.

Reverted https://github.com/pytorch/pytorch/pull/150690 on behalf of https://github.com/atalman due to failing internal test ([comment](https://github.com/pytorch/pytorch/pull/150690#issuecomment-2787905966))
2025-04-09 00:06:32 +00:00
27ded359a5 Fix inplacing with multiple, fused uses (#150845)
We had `can_inplace` defined on a single use. When that buffer has multiple uses inside a fused node, we need to check if the other accesses have the same index. Otherwise we may read memory that has already been written to from inplacing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150845
Approved by: https://github.com/zou3519, https://github.com/exclamaforte, https://github.com/atalman, https://github.com/jansel
2025-04-09 00:05:07 +00:00
89505f4498 [AOTI] Always use oss schema for ExternKernelNodes serialization (#150197)
Summary: Added a field `protocol` to `ExternKernelNodes` and all the lowering pass will always use the oss schema to serialize external kernel nodes from now on.

Test Plan: CI

Differential Revision: D72020444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150197
Approved by: https://github.com/zhxchen17
2025-04-08 22:35:28 +00:00
17f9276e29 Code Clean: Remove python3.8 specific code because PyTorch now need Python3.9 and later (#150834)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150834
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-04-08 18:53:55 +00:00
901b02cf16 [Inductor] fix alignement assumption for fallback (#150777)
Inductor right now only works properly for fallback kernels producing aligned output.
When Inductor create layout for fallback kernel output, Inductor does not add the tensor offset to the layout [link](2a1e2b88ed/torch/_inductor/ir.py (L6935-L6941)). Thus unaligned output will be treated as aligned. Adding the offset to the layout directly does not work since that change the index expression in the generated kernel and we may 'double' applying the offset. Triton already considers the offset when passing in the data_ptr.

To solve this issue, we track the unaligned buffer names instead.

This potentially can fix the internal issues we are debugging here: https://fb.workplace.com/groups/1075192433118967/permalink/1618308128807392/

Differential Revision: [D72600784](https://our.internmc.facebook.com/intern/diff/D72600784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150777
Approved by: https://github.com/eellison, https://github.com/jansel
2025-04-08 18:49:44 +00:00
c36d9b0d8d [Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/torch/ao (#150826)
Differential Revision: D72615631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150826
Approved by: https://github.com/ydwu4
2025-04-08 18:49:22 +00:00
aafc4b6188 Do not depend on numpy during the import (#150816)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/149681

We can follow up with a different implementation that does not use numpy(potentially with Torch primitives).

Test Plan:
pending:

contbuild & OSS CI

Differential Revision: D72609835

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150816
Approved by: https://github.com/jerryzh168, https://github.com/cyyever, https://github.com/albanD
2025-04-08 18:12:53 +00:00
e6bd133866 add batching rule for torch.Tensor.scatter_add_ (#150543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150543
Approved by: https://github.com/zou3519
2025-04-08 18:00:10 +00:00
97759614c2 [dynamo] reconstruct functions decorated in the compiled region properly (#150645)
We were previously unable to reconstruct functions that were decorated in the compiled region.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150645
Approved by: https://github.com/jansel
2025-04-08 17:32:46 +00:00
4926bd6004 Revert "Fix the Problems About Defining Static Variable in Inline Function (#147095)"
This reverts commit 3da14d38bd396f5bbe8494872d1509efa1a6f048.

Reverted https://github.com/pytorch/pytorch/pull/147095 on behalf of https://github.com/atalman due to breaks internally ([comment](https://github.com/pytorch/pytorch/pull/147095#issuecomment-2787129770))
2025-04-08 17:10:36 +00:00
3e0038ae85 Fix torch.matmul related out dtype check (#148174)
----

- torch.matmul -> CompositeImplicitAutograd -> dot_out (when left_dim == 1 & right_dim == 1)
                                            -> mv_out (when left_dim == 2 & right_dim == 1)
                                            -> mm_out (when left_dim == 1 & right_dim == 2)
                                            -> ...
- torch.dot
- torch.vdot
- torch.mm
- torch.mv

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148174
Approved by: https://github.com/jansel
2025-04-08 17:00:28 +00:00
173f126068 [invoke_subgraph] Preserve node meta (#150782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150782
Approved by: https://github.com/bdhirsh
ghstack dependencies: #150666
2025-04-08 16:57:39 +00:00
4447352e64 Revert "[CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705)"
This reverts commit 5228986c395dc79f90d2a2b991deea1eef188260.

Reverted https://github.com/pytorch/pytorch/pull/150705 on behalf of https://github.com/atalman due to break periodic tests ([comment](https://github.com/pytorch/pytorch/pull/150705#issuecomment-2787017751))
2025-04-08 16:29:05 +00:00
97f34f0125 [ROCm][Windows] Include AOTriton dependent sources in Windows build (#150521)
Includes ATen native transformers hipified sources in ROCm+Windows build. This was removed due to Trinton not being available on Windows, but this causes further linker errors. Setting `USE_FLASH_ATTENTION=0` and `USE_MEM_EFF_ATTENTION=0` during the build will mitigate the missing headers, but also not cause any linker errors, so we will use this approach for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150521
Approved by: https://github.com/jeffdaily
2025-04-08 16:18:15 +00:00
1239260a0e [Accelerator][Chore] Use existing acc when raising an error (#150829)
As the title said, `acc` already exists so we just use it instead of calling `current_accelerator()` again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150829
Approved by: https://github.com/guangyey, https://github.com/Skylion007
2025-04-08 16:05:06 +00:00
ec5f2e3028 [Build] Fix fbgemm build with gcc-12+ (#150847)
By suppressing more warnings

TODO: fbgemm pin really needs to get updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150847
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-04-08 16:03:40 +00:00
52d172eafd Facilitate at::_weight_int4pack_mm_with_scale_and_zeros related registration (#147962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147962
Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang
ghstack dependencies: #137566

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-04-08 15:36:07 +00:00
da7322548b [Intel GPU] int4 WOQ gemm XPU Support (#137566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137566
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-04-08 15:36:06 +00:00
05365e380d Remove torch functions that do not support device arguments from _device_constructor (#150290)
As the title stated

In Addition:
- I have checked all the functions in _device_constructor and found ``torch.vander`` also don`t support device arguments
- Remove the duplicated function such as torch.ones and torch.asarray

Related issue:https://github.com/pytorch/pytorch/issues/150284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150290
Approved by: https://github.com/albanD
2025-04-08 15:13:55 +00:00
a402c2f203 Remove redundant code in cuda/__init__.py (#150529)
As the title stated.

Follow: https://github.com/pytorch/pytorch/pull/147078
Fix issue: https://github.com/pytorch/pytorch/issues/150519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150529
Approved by: https://github.com/eqy
2025-04-08 15:03:21 +00:00
ad516180e0 Update CPython tests for ctx manager to use unittest (#146501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146501
Approved by: https://github.com/zou3519
ghstack dependencies: #146500
2025-04-08 14:55:17 +00:00
f3b2fb6c66 Allow trace through unittest (#146500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146500
Approved by: https://github.com/anijain2305
2025-04-08 14:55:17 +00:00
1791b4150b Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (#150682)
I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.

I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150682
Approved by: https://github.com/kwen2501
ghstack dependencies: #150681
2025-04-08 13:00:59 +00:00
3649e2e7bd Safer bookkeeping of NCCL communicators (#150681)
This consists mainly in two changes:
- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)
- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150681
Approved by: https://github.com/kwen2501
2025-04-08 11:12:37 +00:00
3da14d38bd Fix the Problems About Defining Static Variable in Inline Function (#147095)
Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations

- Remove unused header files
- Move the inline function that defines the static variable to .cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-04-08 10:23:02 +00:00
881d99495d Add more check for torch.ormqr (#150759)
As the title statd.

Please refer to https://github.com/pytorch/pytorch/issues/150674 for more info.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150759
Approved by: https://github.com/lezcano
2025-04-08 08:26:05 +00:00
a106842ea8 [XPU] Fix XPU unit test on Windows (#150520)
This PR is to resolve issue reported in https://github.com/intel/torch-xpu-ops/issues/1478

There are two cases failing in our Windows CI enabling.

- **test_xpu.py::TestXpuXPU::test_lazy_init_xpu** Needs to add  `if __name__ == '__main__':` for Windows when using multiprocess. Refer to https://stackoverflow.com/a/18205006
```
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 24, in <module>
    test_multi_process(model, input)
  File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 16, in test_multi_process
    assert p.exitcode == 0
AssertionError
```

- **test_xpu.py::TestXpuXPU::test_wrong_xpu_fork_xpu** is a linux only test case, we should skip it on Windows. Refer to 248487f455/test/test_multiprocessing.py (L609)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150520
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-04-08 07:02:40 +00:00
58ede0cca3 [Inductor XPU] Refine test_mkldnn_pattern_matcher.py to be reusable for XPU. (#150286)
This PR extracts some test cases from TestPatternMatcher into a newly created TestPatternMatcherGeneric, and uses instantiate_device_type_tests to make them reusable across multiple devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150286
Approved by: https://github.com/jansel
2025-04-08 05:42:44 +00:00
f8aa6404ac Refactor: add initialization of math.lcm into torch_c_binding_in_graph_functions (#150766)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150766
Approved by: https://github.com/aorenste, https://github.com/jansel
2025-04-08 04:12:26 +00:00
c9c0f8eae3 Add plot for torch.nn.Threshold and torch.nn.GLU (#150171)
Fixes #150170

## Changes

- Add plot for `torch.nn.Threshold` and `torch.nn.GLU`
- Add example output make them easier get result by users

## Test Result

![image](https://github.com/user-attachments/assets/f6c5bc46-f9b7-4db7-9797-e08d8423d1b3)

![image](https://github.com/user-attachments/assets/ad4e6c84-7b29-44f1-b7bd-9c81e4a92ef8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150171
Approved by: https://github.com/albanD
2025-04-08 03:55:37 +00:00
7e11089fe5 Optimize dataloader Self typing (#146816)
Optimize `dataloader.py` method return type with Self typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146816
Approved by: https://github.com/albanD
2025-04-08 03:52:23 +00:00
836955bdbd [Manylinux 2.28] Correct Linux aarch64 cuda binaries wheel name (#150786)
Related to: https://github.com/pytorch/pytorch/issues/149044#issuecomment-2784044555
For CPU binaries we run auditwheel however for cuda binaries auditwheel produces invalid results . Hence we need to rename the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150786
Approved by: https://github.com/malfet
2025-04-08 02:58:28 +00:00
73b4938f7c [cuda] Add new faster gammabeta backward kernel (#148605) (Reapply with launch bounds) (#150625)
# Changes over the previous PR

This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel.

Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266.

This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added:

```
git diff HEAD^
diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 0d63a2f979c..3ce2c24c18e 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -657,6 +657,7 @@ bool aligned_grid
 >
 __global__
 void
+__launch_bounds__(block_dim_x * block_dim_y)
  GammaBetaBackwardCUDAKernelTemplate(
     int64_t M,
     int64_t N,
```

I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg

<details>
<summary> Repro script that fails on Blackwell </summary>

```
import torch
from torch.nn import init
# from transformer_nuggets import init_logging
# from transformer_nuggets.utils.benchmark import profiler
# from pathlib import Path

# init_logging()

class PermuteModule(torch.nn.Module):
    def __init__(self, permutation):
        super(PermuteModule, self).__init__()
        self.permutation = permutation
    def forward(self, x:torch.Tensor) -> torch.Tensor:
        assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!"
        return x.permute(*self.permutation)

def test(n_layers:int, conv_stride:int):
    _sequence = []
    for _ in range(n_layers):
        # Conv1d inputs are (N x C x L), LayerNorm expects (* x C). Dims must be permuted between modules.
        _sequence += [
            PermuteModule((0,2,1)),
            torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False),
            PermuteModule((0,2,1)),
            torch.nn.LayerNorm(512),
            torch.nn.ReLU()
        ]
    model = torch.nn.Sequential(*_sequence).to(device="cuda")
    data = torch.randn((100,2048,512), device="cuda")
    out = model(data)
    loss = torch.nn.functional.mse_loss(out, torch.rand_like(out))
    loss.backward()

torch.autograd.set_detect_anomaly(True)
print(f"Torch version: {torch.__version__}")

# with profiler(Path("conv")):
#     # print(f"layers=1, stride=1")
#     # test(n_layers=1, conv_stride=1)
#     # print(f"layers=2, stride=1")
#     # test(n_layers=2, conv_stride=1)
#     # print(f"layers=1, stride=2")
#     # test(n_layers=1, conv_stride=2)
#     print(f"layers=2, stride=2")
#     test(n_layers=2, conv_stride=2)

print(f"layers=2, stride=2")
test(n_layers=2, conv_stride=2)
# we will not reach this print statement.
print("DONE.")
```

</details>

I also re-ran my performance benchmark and found no regressions over the previous PR.

# Full description of the old PR

Original PR: https://github.com/pytorch/pytorch/pull/148605

This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way.

To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables:

1. dtype in {half, float}
2. M in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
3. N in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
4. Whether we flush the L2 cache before running the backward pass

Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster).

In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the *backward pass* being 1.42x faster than the old *backward pass*.

Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old:

M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.35
    Elapsed Cycles                cycle       27,526
    Memory Throughput                 %         2.21
    DRAM Throughput                   %         0.54
    Duration                         us        20.42
    L1/TEX Cache Throughput           %         4.31
    L2 Cache Throughput               %         2.62
    SM Active Cycles              cycle     1,475.02
    Compute (SM) Throughput           %         0.29
    ----------------------- ----------- ------------
```

M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.34
    Elapsed Cycles                cycle       10,920
    Memory Throughput                 %         5.64
    DRAM Throughput                   %         1.35
    Duration                         us         8.13
    L1/TEX Cache Throughput           %         1.92
    L2 Cache Throughput               %         6.89
    SM Active Cycles              cycle     3,554.41
    Compute (SM) Throughput           %         0.67
    ----------------------- ----------- ------------
```

Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following:

<img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" />

There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made:

![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738)

For dtype=float32, we get a similar chart:

<img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" />

The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension).

The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough.

I am including the regressions here for completeness' sake:

<img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" />

To see this better:

1. Click the image
2. Right click the expanded image and open in a new tab
3. Go to that tab and left click once to zoom in

If you want to see the full data, here it is:

![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1)

I also measured binary size and compile time since those are important for developers:

Binary size comparison

![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51)

```
# Original
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so

# This PR
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so
```

The diff in bytes is 302kB which is about a 0.1% increase.

Compile time difference:

```
# Original

real    0m10.931s
user    0m9.676s
sys     0m1.004s

# this PR

real    0m16.720s
user    0m15.514s
sys     0m1.066s

# Command I ran
time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o

```

So the new PR is 6 seconds longer compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625
Approved by: https://github.com/ngimel, https://github.com/atalman
2025-04-08 02:39:41 +00:00
c0991b0316 README: anaconda license violation / no longer recommend anaconda since it's no longer free to use (#150619)
hello,

I was going over the documentation to build pytorch from source.
Unfortunately, the first thing that come up is that you strongly recommend to use anaconda, which shouldn't be used because it's no longer free to use.
Could you please remove that from the doc?

I don't know if you are aware but anaconda is no longer free.
They changed their terms of service in 2020 to restrict commercial usage.
They changed their terms of service in 2024 to forbid downloading anaconda and forbid education and non-profit usage too.
The download is open and doesn't require any registration, but if you download anaconda they will sue you ^^

They started raining lawsuits against users since last year. You may have heard about anaconda vs intel in the news. They started another 5 or so in the last few months.
https://www.reuters.com/legal/litigation/intel-sued-copyright-infringement-over-ai-software-2024-08-09/

You may need to adjust more doc and adjust your build system. The free to use alternatives are miniforge with the conda-forge channel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150619
Approved by: https://github.com/seemethere
2025-04-08 02:10:31 +00:00
d7f3cd0ac3 Add Half support for weight_norm on CPU (#148878)
Fixes #148867.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148878
Approved by: https://github.com/leslie-fang-intel, https://github.com/cyyever, https://github.com/albanD
2025-04-08 01:12:29 +00:00
5228986c39 [CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705)
By addressing a feedback requested at https://github.com/pytorch/pytorch/pull/145746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150705
Approved by: https://github.com/atalman
2025-04-08 00:46:13 +00:00
e9e5682a4a [ROCm] Build Pytorch extensions with amdclang++ (#150451)
Here are the following modifications made to cpp_extension.py- 1) Changed compiler flag to use --version.
2) Added a feature to convert alpha-numeric string to numeric string for the version string returned by compiler. This was the source of error as the parser was failing on parsing alpha-numeric version string.

Build with following pytorch extensions- Apex, TorchVision, TorchAudio & DeepSpeed.
Unit tested with following pytorch extensions- Apex, TorchVision.

(cherry picked from commit c873aeac35851a7d5000eb7f24561d3f56c2ffbd)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150451
Approved by: https://github.com/jeffdaily
2025-04-07 23:31:29 +00:00
91173ff89a Fixing NCCL abort hang issue when a ProcessGroupNCCL manages multiple ncclComms (#150690)
Detail of the issue:

If PyTorch issues send/recv to each 2 rank comm, and these comms are managed by a single ProcessGroupNCCL instance, then comms need to abort either in sequence or in group.

I.e. the following sequential abort will cause hang in NCCL. recv(..., comm0, stream);
send(..., comm1, stream);
abort(comm1);
abort(comm0);

Fixes #119797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150690
Approved by: https://github.com/kwen2501
2025-04-07 23:20:49 +00:00
6ea5514e04 [invoke_subgraph] Lazy backward (#150666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150666
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2025-04-07 22:44:43 +00:00
78fe079c97 Support having no metadata file for HuggingFaceStorageReader (#150701)
Summary: If there is only one safetensors file, we don't need users to have a metadata file and we can just construct it from the keys of that file. This is a use-case for some HuggingFace models, so adding support for it

Test Plan:
ensure existing tests pass
tested e2e in a notebook

Differential Revision: D72472490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150701
Approved by: https://github.com/joecummings
2025-04-07 22:10:39 +00:00
fbccbfedaf [BE] Fix Amp.metal compilation warning (#150783)
Deleting unused `uint tid` fixes
```
[114/1416] Compiling /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal to Amp_30.air
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Amp.metal:70:10: warning: unused parameter 'tid' [-Wunused-parameter]
    uint tid [[thread_position_in_grid]]) {
         ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150783
Approved by: https://github.com/wdvr, https://github.com/atalman
2025-04-07 22:05:00 +00:00
eba05e2d3e [AO] Refactor convert and add QuantAffinePlaceholderObserver (#150644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150644
Approved by: https://github.com/jerryzh168
ghstack dependencies: #150642, #150643
2025-04-07 20:52:45 +00:00
5653fb3525 [AO] Add Moving Average Affine Observer (#150643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150643
Approved by: https://github.com/jerryzh168
ghstack dependencies: #150642
2025-04-07 20:52:45 +00:00
ed0dea3e24 [AO] update port_metadata_pass to support quant_affine ops (#150642)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150642
Approved by: https://github.com/jerryzh168
2025-04-07 20:52:44 +00:00
bf1132c196 Revert "Generalize poison fork logic for each device backend (#144664)"
This reverts commit d86c14156d875b782b82dda96842a1f77910f010.

Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing periodic test: python test/test_cpp_extensions_mtia_backend.py TestCppExtensionMTIABackend.test_device_context ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2784506104))
2025-04-07 20:09:53 +00:00
f8b53f4a75 [export] raise when Dim.DYNAMIC 0/1 specializes (#150716)
Previously we didn't catch this, mark_dynamic() just doesn't allocate a symbol for it

Differential Revision: D72486930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150716
Approved by: https://github.com/angelayi
2025-04-07 18:58:42 +00:00
2a1e2b88ed [logging] Add pgo remote get/put timings to dynamo_compile (#150322)
Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/xf950tw8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150322
Approved by: https://github.com/ppanchalia
2025-04-07 18:08:26 +00:00
6fcffd8cd1 Optimize SVE embedding performance (#150176)
Change loop unrolling strategy. Previously, the script only unrolls the inner loop over block_size when block size is multiple of vector length. This version instead unrolls the outer loop which reduces the number of load/store for accumulation into the output array and improves performance for cases when block size is not multiple of vector length.

Benchmarking script:
```python
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
import numpy as np
import time
import sys

np.random.seed(0)
torch.manual_seed(0)

num_embeddings = 400000
embedding_dim = int(sys.argv[1])
multi_hot = 100
batch_size = 400
nrun = 1000

class SimpleEmbeddingBagModel(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(SimpleEmbeddingBagModel, self).__init__()

        weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32)).to(torch.float16)

        # Defining the EmbeddingBag layer
        self.embedding_bag = torch.nn.EmbeddingBag(num_embeddings, embedding_dim, _weight=weights,
                                                   mode='sum', include_last_offset=True, dtype=torch.float32)

    def forward(self, input, offsets):
        # Forward pass through the EmbeddingBag layer
        result32 = self.embedding_bag(input, offsets, per_sample_weights=None)
        return result32

# Instantiate the model
model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
model.eval()

# Example input
input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long)

offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot))

with torch.no_grad():
    # warm up
    output32 = model(input_tensor, offsets)

    ti = time.time_ns()
    for i in range(nrun):
        _ = model(input_tensor, offsets)
    tf = time.time_ns()
    print("{:3d} {:.3E}".format(embedding_dim, (tf-ti)/nrun/1.e6))
```
Speedup on NEOVERSEV1 with 1 thread
![embedding](https://github.com/user-attachments/assets/16e567ed-b9a5-4db3-90b8-dec66d5414a7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150176
Approved by: https://github.com/digantdesai, https://github.com/malfet
2025-04-07 18:01:54 +00:00
7d2411d30e [DCP][OSS] Introduce barrier util in the DistWrapper for rank local checkpointing (#150748)
Summary: Introduce barrier util in the DistWrapper for rank local checkpointing. This barrier will be used at the end of the rank local checkpointing to ensure all ranks synchronize.

Test Plan: UTs

Differential Revision: D72541431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150748
Approved by: https://github.com/MeetVadakkanchery
2025-04-07 17:33:07 +00:00
957faaadca Avoid overflow in vector_norm for scalar input (#144073)
Fixes https://github.com/pytorch/pytorch/issues/143960 where torch.dist gave different results from eager due to vector_norm overflowing and eager mode avoids the overflow for single element reductions by not computing the power and then the root.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144073
Approved by: https://github.com/eellison, https://github.com/laithsakka
2025-04-07 17:10:10 +00:00
06e9deabb6 [c10d][fr] Improve FR dump robustness with all watchdog broadcast wait and more frequent store check (#150652)
When debugging FR missing dump and missing dump logs, I have couple initial findings:
1. On the same rank, if a second watchdog timeout triggers on a different PG(or subPG), that watchdog thread will immediately throw exception instead of sleeping. We want to fix that by still making the watchdog thread to wait for 1 min.
2. The FR dump takes about 900ms to 1200ms so, we are not checking the store frequently enough. But instead of changing the frequency from 1sec to 300ms, we finally decided to just let all ranks just sleep for 1 min universally rather than using a promise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150652
Approved by: https://github.com/kwen2501
2025-04-07 16:33:27 +00:00
56ab71de98 [ROCm] Expand workspace size for gfx95 (#150632)
Use same workspace size for gfx95* as gfx94*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-04-07 16:05:56 +00:00
0ad2c5d7e2 Add RECORD_FUNCTION for AOTI (#150150)
Only add RECORD_FUNCTION for shim_fn now.
Next step need to add RECORD_FUNCTION for all the aoti_torch_* functions.

Fixes https://github.com/pytorch/pytorch/issues/148650

Some code gen by aoti
```c++
    AtenTensorHandle buf1_handle;
    AtenTensorHandle buf2_handle;
    AtenTensorHandle buf3_handle;
    AtenTensorHandle buf4_handle;
    {RECORD_FUNCTION("aoti_torch_cpu__embedding_bag", c10::ArrayRef<c10::IValue>());AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__embedding_bag(L__self___sparse_arch_embedding_bag_collection_embedding_bags_t_cat_0_weight, arg80_1, arg81_1, 0, 0L, 0, nullptr, 1, -1L, &buf1_handle, &buf2_handle, &buf3_handle, &buf4_handle));}
    RAIIAtenTensorHandle buf1(buf1_handle);
    RAIIAtenTensorHandle buf2(buf2_handle);
    RAIIAtenTensorHandle buf3(buf3_handle);
    RAIIAtenTensorHandle buf4(buf4_handle);
    arg80_1.reset();
    arg81_1.reset();
```

On trace
```
{
  "name": "aoti_torch_cpu__embedding_bag",
  "ph": "X",
  "ts": 68874.450000,
  "dur": 361.291000,
  "tid": 2,
  "pid": "CPU Functions",
  "args": {}
},
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150150
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2025-04-07 15:12:29 +00:00
f813d64f54 cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
ghstack dependencies: #150671, #150672
2025-04-07 14:20:06 +00:00
f0abbabac1 AOTI fallback ops: sort alphabetically (#150672)
This is just a housekeeping task that makes the listed fallback op order match what's in the generated C shim files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150672
Approved by: https://github.com/desertfire
ghstack dependencies: #150671
2025-04-07 14:20:06 +00:00
5e3c8214b5 cpp_wrapper: Re-enable code disabled for forward compatibility (#150671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150671
Approved by: https://github.com/desertfire
2025-04-07 14:20:06 +00:00
99c9a31386 [submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559)
Summary:
Profiler side of memory snapshot.

1. Add API to actually do snapshot when client interface is called
2. Add ifdefs to builds so that kineto hooks snapshot correctly.

Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship

Test Plan: {F1976563426}

Reviewed By: sanrise

Differential Revision: D70733247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559
Approved by: https://github.com/sanrise
2025-04-07 13:04:38 +00:00
e209625334 [torchrec] update local_shards_wrapper to latest version (#150469)
Summary: Adding new ops, support for empty shards, and fixed initializations for downstream checkpointing.

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_shards_wrapper

Differential Revision: D72271275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150469
Approved by: https://github.com/XilunWu
2025-04-07 13:00:52 +00:00
cdf3b63e32 Update slow tests (#150283)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150283
Approved by: https://github.com/pytorchbot
2025-04-07 11:49:59 +00:00
25662d38d5 [xla hash update] update the pinned xla hash (#132021)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132021
Approved by: https://github.com/pytorchbot
2025-04-07 11:35:56 +00:00
164d2c887b Add check in test_cow_input to ensure COW data is never changed (#150723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150723
Approved by: https://github.com/Skylion007
2025-04-07 04:35:00 +00:00
24aadb40fb [precompile] Serialization for GlobalStateGuard (#150636)
Summary: To preserve global state guards we need to make the C++ type serialzable. Using json because it's easier to do and we don't have a lot of data in global state.

Test Plan: test_dynamo -k test_global_state_guard_serialization

Differential Revision: D72410611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150636
Approved by: https://github.com/williamwen42
2025-04-07 03:10:03 +00:00
b6929aef08 Fix conv2d strided prologue (#150697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150697
Approved by: https://github.com/drisspg
2025-04-07 02:26:58 +00:00
d86c14156d Generalize poison fork logic for each device backend (#144664)
# Motivation
Generalize the posion_fork code to make it reusable across different devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-04-07 02:06:21 +00:00
d98575806b Generalize compile collective to avoid cuda-bias (#150405)
Fixes https://github.com/intel/torch-xpu-ops/issues/1527
Let the combination of `compile` and `collective` to support more devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150405
Approved by: https://github.com/guangyey, https://github.com/jansel

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-07 01:54:20 +00:00
d8d306cbc6 Suppress -Wunused-function for DSA (#150735)
Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D72458590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150735
Approved by: https://github.com/eqy, https://github.com/cyyever
2025-04-07 01:47:35 +00:00
370ba6b96f [codemod] Fix -Wambiguous-reversed-operator in aten/src/ATen/cuda/tunable/Tunable.h (#150744)
Summary:
`-Wambiguous-reversed-operator` warns about ambiguous reversed operators, e.g. `a < b` and `b > a` are both valid. Such operators are disallowed in C++20. This codemod fixes the warnings.

#buildsonlynotests - If this diff compiles, it works.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D72535527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150744
Approved by: https://github.com/drisspg
2025-04-07 01:45:03 +00:00
47b494ef69 Add type hints to _tensor_docs.add_docstr_all (#150715)
There is some sort of bug in `pytype` where if this function doesn't have type hints, `pytype` will spend 10 minutes inferring the types. Not that this matters much for a project not using `pytype`, but it led me to realize that this function could easily be type hinted and is not, so here is a PR adding some type hints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150715
Approved by: https://github.com/Skylion007
2025-04-06 22:25:34 +00:00
0aaf35310a Overload unary - operator on at::vec::Vectorized to call neg() (#150568)
Makes Vectorized look even more like a scalar type, getting me closer to being able to use the same generic code with scalars and Vectorized (e.g., for sigmoid, which needs `exp(-x)`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150568
Approved by: https://github.com/Skylion007
ghstack dependencies: #150380
2025-04-06 21:12:27 +00:00
912102b4ec Make at::vec::Vectorized ops work with scalars (#150380)
I noticed that I couldn't use `vec::Vectorized` operations with scalars, even though there is an implicit conversion from `T` to `vec::Vectorized<T>`, so I made it work.

Test Plan: Added tests. Reverted vec_base.h, left the new tests in place, and confirmed that new tests don't compile in that state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150380
Approved by: https://github.com/Skylion007
2025-04-06 21:12:27 +00:00
8adfcd35c3 [cuDNN][SDPA] Loosen constraints for GQA for cuDNN Attention (#150337)
cuDNN attention doesn't require key and value tensors to have the same number of heads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150337
Approved by: https://github.com/drisspg
2025-04-06 20:31:11 +00:00
6a8ab902a2 [AOTI][dashboard] Fix mis-calculated memory compression ratio (#150695)
Summary: https://github.com/pytorch/pytorch/pull/149817 introduced an extra warmup run to compute AOTI memory compression ratio, but since weights are only loaded once in the AOTI run, the peak memory seen in the extra warmup won't include the weight, which causes an aritifically high memory compression ratio. This PR removes that extra warmup run, and calls reset_peak_memory_stats in the proper place instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150695
Approved by: https://github.com/yushangdi
2025-04-06 19:51:22 +00:00
6c38b9be73 [typing] Add type hints to __init__ methods in torch.distributions. (#144197)
Fixes #144196
Extends #144106 and #144110

## Open Problems:

- [ ] Annotating with `numbers.Number` is a bad idea, should consider using `float`, `SupportsFloat` or some `Procotol`. https://github.com/pytorch/pytorch/pull/144197#discussion_r1903324769

# Notes

- `beta.py`: needed to add `type: ignore` since `broadcast_all` is untyped.
- `categorical.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2].
- ~~`dirichlet.py`: replaced `axis` with `dim` arguments.~~ #144402
- `gemoetric.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2].
- ~~`independent.py`: fixed bug in `Independent.__init__` where `tuple[int, ...]` could be passed to `Distribution.__init__` instead of `torch.Size`.~~ **EDIT:** turns out the bug is related to typing of `torch.Size`. #144218
- `independent.py`: made `Independent` a generic class of its base distribution.
- `multivariate_normal.py`: converted `else` branches of mutually exclusive arguments to `if` branch[^2].
- `relaxed_bernoulli.py`: added class-level type hint for `base_dist`.
- `relaxed_categorical.py`: added class-level type hint for `base_dist`.
- ~~`transforms.py`: Added missing argument to docstring of `ReshapeTransform`~~ #144401
- ~~`transforms.py`: Fixed bug in `AffineTransform.sign` (could return `Tensor` instead of `int`).~~ #144400
- `transforms.py`: Added `type: ignore` comments to `AffineTransform.log_abs_det_jacobian`[^1]; replaced `torch.abs(scale)` with `scale.abs()`.
- `transforms.py`: Added `type: ignore` comments to `AffineTransform.__eq__`[^1].
- `transforms.py`: Fixed type hint on `CumulativeDistributionTransform.domain`. Note that this is still an LSP violation, because `Transform.domain` is defined as `Constraint`, but `Distribution.domain` is defined as `Optional[Constraint]`.
- skipped: `constraints.py`, `constraints_registry.py`, `kl.py`, `utils.py`, `exp_family.py`, `__init__.py`.

## Remark

`TransformedDistribution`: `__init__` uses the check `if reinterpreted_batch_ndims > 0:`, which can lead to the creation of `Independent` distributions with only 1 component. This results in awkward code like `base_dist.base_dist` in `LogisticNormal`.

```python
import torch
from torch.distributions import *
b1 = Normal(torch.tensor([0.0]), torch.tensor([1.0]))
b2 = MultivariateNormal(torch.tensor([0.0]), torch.eye(1))
t = StickBreakingTransform()
d1 = TransformedDistribution(b1, t)
d2 = TransformedDistribution(b2, t)
print(d1.base_dist)  # Independent with 1 dimension
print(d2.base_dist)  # MultivariateNormal
```

One could consider changing this to `if reinterpreted_batch_ndims > 1:`.

[^1]: Usage of `isinstance(value, numbers.Real)` leads to problems with static typing, as the `numbers` module is not supported by `mypy` (see <https://github.com/python/mypy/issues/3186>). This results in us having to add type-ignore comments in several places
[^2]: Otherwise, we would have to add a bunch of `type: ignore` comments to make `mypy` happy, as it isn't able to perform the type narrowing. Ideally, such code should be replaced with structural pattern matching once support for Python 3.9 is dropped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144197
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-06 17:50:35 +00:00
49f6cce736 [MPS] grad scaler (#150255)
Fixes #142397

Basic implementation is done. What's left:
- [x] Different dtype/device tensors in the TensorList
- [x] fast path for grouping the foreach kernel
- [x] Tests

Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device.

By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put:
`instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)`

This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150255
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-06 17:06:55 +00:00
55e62ff74a bf16 grouped gemm (#150374)
Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374
Approved by: https://github.com/drisspg
2025-04-06 04:53:24 +00:00
caf8d9bc17 Revert "Fix conv2d strided prologue (#150697)"
This reverts commit 2e4ae2ab41dbe1939bd1ffb427af8e5ea8eaff41.

Reverted https://github.com/pytorch/pytorch/pull/150697 on behalf of https://github.com/ngimel due to breaks rocm build ([comment](https://github.com/pytorch/pytorch/pull/150697#issuecomment-2781218658))
2025-04-06 04:50:15 +00:00
2d98a1caf5 [MTIA] Map names to operand indices when folding submodules (#150692)
When replacing placeholders with getattrs during constant folding, we can have an argument and parameter name mismatch. In fact, there is no guarantee that the parameter name is equivalent to the argument name used in the module call.

Differential Revision: D72415970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150692
Approved by: https://github.com/jfix71
2025-04-06 03:11:14 +00:00
15768cc34b add unit test for preferred_blas_library settings (#150581)
Follow up to #150212 that was committed without a unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581
Approved by: https://github.com/atalman, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-06 01:44:07 +00:00
83b870a28a Fix missing braces for clang CUDA (#150736)
Test Plan: Sandcastle

Differential Revision: D72469764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150736
Approved by: https://github.com/Skylion007
2025-04-06 01:29:59 +00:00
c830c12a87 [MPSInductor] Fix tiled reduction logic (#150737)
In case of tiles, index must include both reduction dimentions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150737
Approved by: https://github.com/dcci
2025-04-06 00:20:41 +00:00
cfea55dbec [MPS] fix inverse bug for N>1024 (#146754)
Fixes #138200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146754
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-05 21:49:21 +00:00
60a45eb862 [AOTInductor] Introduce MaybeOwningAtenTensorHandle for ConstantMap (#150275)
Summary:
We used RAIIAtenTensorHandle for ConstantMap, where RAIIAtenTensorHandle
is a unique_ptr, indicating that all memory handling is by the
AOTInductor internally.

In this PR, we introduce ConstantAtenTensorHandle which replaces
RAIIATenTensorHandle. This class holds a raw AtenTensorHandle, and also
owns a RAIIAtenTensorHandle if user decides to delegate memory
management to AOTInductor.

This is a prerequisite for user managed buffer, this PR, however only
introduces this class and make sure it works with existing AOTInductor
and has the default behavior identical as using RAIIAtenTensorHandle.

Test Plan:
Existing tests. No change should be introduced within this PR.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150275
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-05 06:00:35 +00:00
7ac8186851 [MPSInductor] Speedup sum/prod reductions (#150566)
By using cooperative `simd_sum`/`simd_product` instead of a C-style for loop for threadgroup reductions. This also allows significantly reduce amount of shared memory needed to perform those reductions

Using such reduction increases the `torch.compile` performance for gpt-fast using `stories110M` from 29 tokens/sec to 630 tokens/sec on M4 and changes perf of torch.rand as follows:
|size| before | after |
|------------------------|------------|-------------|
| 512x512         | 202.1       | 131.8       |
| 1024x1024   |   780.6    | 176.9       |
| 2048x2048    |   1423.4       | 339.9      |
| 4096x4097    |    2982.2 | 1047.2      |

Unfortunately, none of the SIMDgroup operations are available for 64-bit integers, but one can simulate the behavior using using `simd_shuffle_down` of 64-bit values represented as `int2` types, that yields reduction in $log_2(threadgroup\\_size)$ steps. [`mlx/kernels/reduction/ops.h](86389bf970/mlx/backend/metal/kernels/reduction/ops.h (L15-L18)) contains an implementation of such algorithm, but alas it yields wrong results on M1/M2(and may be M3 machines) if not all threads in the simdgroup are active which could be observed by running
```python
import torch
lib=torch.mps.compile_shader("""
kernel void do_sum(device int* out, constant int* in, uint idx [[thread_position_in_grid]]) {
  out[idx] = metal::simd_shuffle_down(in[idx], 8);
}
""")
x=torch.arange(22, device='mps', dtype=torch.int32)
y=torch.empty_like(x)
lib.do_sum(y, x)
print(y)
```
that returns following on M4
```
tensor([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,  0,  0,  0,  0, 0,  0,  0,  0], device='mps:0', dtype=torch.int32)
```
but same kernel running on M1 returns
```
tensor([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 15, 16, 17, 18, 19, 20, 21], device='mps:0', dtype=torch.int32)
```
This discrepancy in behavior can be addressed by using `simd_shuffle_and_fill_down`, but any kernels using simd_shuffle_and_fill_down cause an internal compiler error on MacOS-13.2. Considering that OS is to be EOL soon, skip the offending tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150566
Approved by: https://github.com/manuelcandales
ghstack dependencies: #150452, #150457
2025-04-05 02:47:27 +00:00
c14977e91c Use 'rocm' naming for rocm-related workflows/jobs (#150555)
Reduces number of places in the workflow files needing update for ROCm version update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150555
Approved by: https://github.com/jeffdaily
2025-04-05 02:09:11 +00:00
3320efef6b Refresh expected results. (#150264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150264
Approved by: https://github.com/bobrenjc93
2025-04-05 01:11:19 +00:00
2e4ae2ab41 Fix conv2d strided prologue (#150697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150697
Approved by: https://github.com/drisspg
2025-04-05 00:28:56 +00:00
d6887f444f [Inductor] Fallback embedding when sparse is True (#150659)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/150656, fallback `embedding` when sparse is True.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_embedding_sparse
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150659
Approved by: https://github.com/jansel
2025-04-04 23:59:38 +00:00
2e23768d25 Expose symbols on macos in the xplat pytorch stack (#150487)
Summary:
X-link: https://github.com/pytorch/executorch/pull/9819

Had to revert D71321310 because it affected way too many targets and build sizes.

These changes should expose just enough symbols to be buildable in arvr mode on macOS. Could potentially make narrow it down even more by avoiding eg `get_pt_compiler_flags`

Differential Revision: D72255474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150487
Approved by: https://github.com/drisspg
2025-04-04 23:03:16 +00:00
2a2ddff214 [Inductor] Fix consolidating _scaled_mm into mm template TMA error (#150686)
Summary: The previous diff broke a few tests that didn't run on internal or GH CI: T220169086, this fixes that issue. The {% if } block is only supposed to support autotuned parameters (constexpr), and should not be used for locals based on other examples.

Test Plan: buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_tensorwise_scaling_bfloat16_shape_16,32,32_has_bias_False_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)'

Reviewed By: NikhilAPatel

Differential Revision: D72460516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150686
Approved by: https://github.com/eellison, https://github.com/NikhilAPatel
2025-04-04 22:49:22 +00:00
861d2cc02c Add a param for save format in Storage Writer (#150025)
Summary: add a param to specify to the storage writer how to save tensors. Write now the only options are safetensors and torch.save.

Test Plan:
(lintrunner) [ankitageorge@devgpu003.cco3 /data/users/ankitageorge/fbsource/fbcode/caffe2 (1d57cb27b)]$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage
File changed: fbcode//caffe2/torch/distributed/checkpoint/filesystem.py
Buck UI: https://www.internalfb.com/buck2/e80cc963-e34a-4876-b6f4-7ce2794e48dd
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174965882569
Network: Up: 32KiB  Down: 1.9KiB  (reSessionID-ef9fa764-a40a-451b-ab58-08eabe7a9422)
Executing actions. Remaining     0/4                                                                                             3.4s exec time total
Command: test.     Finished 2 local
Time elapsed: 19.6s
Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0

Reviewed By: saumishr

Differential Revision: D70271943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150025
Approved by: https://github.com/saumishr
2025-04-04 17:52:53 +00:00
c53bc616d5 caffe2: Fix lint errors in native/xnnpack/Linear.cpp (#150508)
Summary: See title

Test Plan: Sandcastle

Differential Revision: D72275403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150508
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/cyyever
2025-04-04 17:14:43 +00:00
c93e34d7b5 Revert "bound sympy accuracy (#150383)"
This reverts commit 1bc2b2b12ae1ddd27b0401a1baac3b8099b6fc50.

Reverted https://github.com/pytorch/pytorch/pull/150383 on behalf of https://github.com/laithsakka due to big regression ([comment](https://github.com/pytorch/pytorch/pull/150383#issuecomment-2779227548))
2025-04-04 16:26:00 +00:00
f443035f10 Revert "[cuda] Add new faster gammabeta backward kernel (#148605) (Reapply with launch bounds) (#150625)"
This reverts commit c6defa9443d241dd7a0baac4e708b6e906bd012c.

Reverted https://github.com/pytorch/pytorch/pull/150625 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/150625#issuecomment-2779183414))
2025-04-04 16:05:18 +00:00
07d439e782 [aoti] Split ConstantType definition out of model.h (#150545)
Summary:
Splitting the type definition of ConstantType into a separate header because it's needed by Sigmoid OSS but the entire model.h header include cause the following compilation error:
```
2025-04-01T18:12:42.0391272Z FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp.o
2025-04-01T18:12:42.0417705Z /opt/cache/bin/sccache /opt/cache/bin/clang++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_ENABLE_LLVM -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -DXNN_LOG_LEVEL=0 -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/opt/llvm/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/nlohmann -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-3.0.2 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/fbgemm/include -I/
2025-04-01T18:12:42.0444143Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/kernels/AOTICallDelegateKernel.cpp:5:
2025-04-01T18:12:42.0445081Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTIDelegateExecutor.h:6:
2025-04-01T18:12:42.0446002Z In file included from /var/lib/jenkins/workspace/torch/csrc/nativert/executor/AOTInductorModelImpl.h:5:
2025-04-01T18:12:42.0447549Z /var/lib/jenkins/workspace/torch/csrc/inductor/aoti_runtime/model.h:78:13: error: function 'RAII_cpuMalloc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
2025-04-01T18:12:42.0448656Z RAIIDataPtr RAII_cpuMalloc(size_t num_bytes) {
```

model.h defines RAII_malloc functions directly into anonymous namespace which seems pretty sad. we should do something about it but may not in the current diff.

Test Plan: CI

Differential Revision: D72320413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150545
Approved by: https://github.com/desertfire
2025-04-04 15:48:45 +00:00
1b0a023dde [Dynamo][Misc] Apply typing hints for codegen (#150289)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150289
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-04-04 14:26:22 +00:00
295b7e21eb [MPS/inductor] Add support for hermite_polynomial_h. (#150664)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150664
Approved by: https://github.com/malfet
2025-04-04 13:14:52 +00:00
09c4da9325 [CUDA][avgpool2d] Fix backward launch bounds again for sm100, sm120 (#150640)
`__CUDA_ARCH__` is not visible in host code, which causes incorrect launch bounds and `too many resources requested for launch` on blackwell

CC @atalman @malfet as we would want this in 2.7 @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150640
Approved by: https://github.com/malfet, https://github.com/drisspg, https://github.com/atalman
2025-04-04 13:05:40 +00:00
73358d37da Fix codegen, change str comparison opeator to == for proper equality … (#150611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150611
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-04-04 09:59:59 +00:00
4854926aeb Revert "Add torch._scaled_mm for CPU (#150410)"
This reverts commit 3b02f795c5ad2339794b15b370c0e4a235d36adf.

Reverted https://github.com/pytorch/pytorch/pull/150410 on behalf of https://github.com/malfet due to It breaks ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/150410#issuecomment-2777704212))
2025-04-04 06:52:54 +00:00
f3cb3557d6 [executorch hash update] update the pinned executorch hash (#149817)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149817
Approved by: https://github.com/pytorchbot
2025-04-04 05:21:44 +00:00
98d06b401b [Dynamo] Fix dict.items() return type (#150112)
Fixes #150110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150112
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-04-04 04:32:13 +00:00
e6e1f8c272 [audio hash update] update the pinned audio hash (#150589)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150589
Approved by: https://github.com/pytorchbot
2025-04-04 04:29:45 +00:00
c6d79c163c [dynamic shapes] allow duck typing for 0/1 (#150222)
Fixes #150184

e.g. for config.backed_size_oblivious=True and compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150222
Approved by: https://github.com/laithsakka
2025-04-04 03:24:46 +00:00
7df6f930e8 Adapt test_misc.py for HPUs (#149499)
This PR is related to https://github.com/pytorch/pytorch/pull/145476 . That PR had two files (test_functions.py and test_misc.py) . test_functions was causing CI/rebase/merge issues and hence removed for now. This PR contains only test_misc.py.

This is a continuation of https://github.com/pytorch/pytorch/pull/144387 .

## MOTIVATION
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

## CHANGES
Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

PS: Most of these changes were initially part of https://github.com/pytorch/pytorch/pull/147609 , but closed that PR due to merge conflicts. The review comments were handled in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149499
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/cyyever
2025-04-04 02:47:43 +00:00
ed0fd2fa7a clang-format aten/src/ATen/cpu/vec/*.h (#150426)
I got a complaint about indentation on #150380. Make the machines fix it for us.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150426
Approved by: https://github.com/aditew01, https://github.com/cyyever, https://github.com/frost-intel, https://github.com/Skylion007
2025-04-04 02:41:11 +00:00
bd9c42ebfb [c10d] Surface error type when we unlink and create named pipe for DumpPipe (#150648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150648
Approved by: https://github.com/fegin, https://github.com/kwen2501
2025-04-04 02:12:32 +00:00
a9e2f22405 [Bugfix] Fix compile error with torch.Tensor.unsqueeze_ and inplace views called from Tensor Class (#150573)
Fixes #129673

### Summary:
Modifying a tensor by reshaping in place (such as `unsqueeze_`) should cause a graph break; however, when accessed through `torch.Tensor` api as opposed to as self attribute caused the code to crash with an error (see attached issue)

Paths differed when traced due to the stack variable popped, as:
* `self.unsqueeze_` pops a `LazyVariableTracker` which gets resolved to `TensorVariable`, so when looking for the method, triggers the fn call `var_getattr`  in `_dynamo/variables/tensor.py`; since this is an inplace view (metadata mutation) on graph input, it is not well supported so should fall back (see [L446](1017927c83/torch/_dynamo/variables/tensor.py (L446)) in that file)
* `torch.Tensor.unsqueeze` pops a `UserDefinedClassVariable` so when looking for the method, triggers the fn call `var_getattr` in `_dynamo/variables/user_defined.py` on [L273](a8f6b40e36/torch/_dynamo/variables/user_defined.py (L273)).  This path tries to build a variable tracker from the obj popped, which resolves to a trace_rule , and as a Tensor method, is resolved to `TorchInGraphFunctionVariable` on [L3767](a8f6b40e36/torch/_dynamo/trace_rules.py (L3767))

So, one straightforward option is to check if the fn is an inplace_view on a input tensor in `torch.py`  when we resolve the `__call__function` for the `TorchInGraphFunctionVariable` instead, which resolves the bug by providing a graph break

### Test
```
pytest test/dynamo/test_functions.py::FunctionTests::test_unsqueeze_inplace
```

Results in
```
Running 1 items in this shard

test/dynamo/test_functions.py .                                                                                                                                                                    [100%]

=========================================================================================== 1 passed in 9.16s ==========================================================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150573
Approved by: https://github.com/anijain2305
2025-04-04 01:58:34 +00:00
1979a409e9 Make CompileEventLogger more defensive w.r.t to AOTAutogradCache and FXGraphCache (#150423)
This PR makes it so that we don't crash due to logging if we invoke AOTAutogradCache/FXGraphCache without using dynamo. This is preparation for supporting certain VLLM use cases where they store graph modules and have special handling in conjunection with the caches.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150423
Approved by: https://github.com/oulgen
2025-04-04 01:55:13 +00:00
f9f6c080d8 support guard or false/true in user code and add tests (#150178)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150178
Approved by: https://github.com/pianpwk
2025-04-04 01:19:14 +00:00
d0026fa138 [ROCm][TunableOp] Fix UT race condition and reduce UT duration. (#150463)
This PR fixes two race conditions that occur when UT tests are run:
- In a particular order within a single shard.
- Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name.

There were two other minor improvements to the UTs:
- matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed.
- bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful.

Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150463
Approved by: https://github.com/jeffdaily
2025-04-04 01:12:03 +00:00
1bc2b2b12a bound sympy accuracy (#150383)
Differential Revision: D72215735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150383
Approved by: https://github.com/pianpwk
2025-04-04 00:15:32 +00:00
b0e28f60df Revert "add unit test for preferred_blas_library settings (#150581)"
This reverts commit 781d28e2655f88ae2fef827ed110f22ed553a0ab.

Reverted https://github.com/pytorch/pytorch/pull/150581 on behalf of https://github.com/clee2000 due to new test broken internally D72395624 ([comment](https://github.com/pytorch/pytorch/pull/150581#issuecomment-2777228731))
2025-04-03 23:51:49 +00:00
1ab6c4ff04 [Codemod][AddExplicitStrictExportForTrainingInferenceArg] caffe2/ (#149595)
internal diff: D71497480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149595
Approved by: https://github.com/Skylion007
2025-04-03 23:50:13 +00:00
8878289f89 [aten] 8 bytes aligned vector loads for bf16 and fp16 dtypes in torch.cat (#150233)
Enable aligned vector loading for 2 bytes datatypes in torch.cat. Specifically:
1. reduce the vector length to 8 bytes for 2-byte types (fp16, bf16 etc)
2. enable through a conditional template

The reason why 8-byte vector loading was chosen for fp16 and bf16:
16-byte load results in heavier register overheads (i.e. 4 register per load for fp32 -> 8 register per load for fp16). Therefore, to employ the benefits of vectorized loading, we reduced ALIGNED_VEC_LOAD_BYTES to 8 for fp16 and bf16

### perf testing:

before:
```
torch-cat-D1-30108-D2-624-D3-772-dtype-torch.float32:
         B  pt_eager      copy
0    100.0  0.022621  0.036162
1   1000.0  0.133616  0.207051
2  10000.0  1.326848  1.848768
3  20000.0  2.744544  3.692128
torch-cat-D1-30108-D2-624-D3-772-dtype-torch.bfloat16:
         B  pt_eager      copy
0    100.0  0.022434  0.035477
1   1000.0  0.140608  0.144518
2  10000.0  1.303792  1.229584
3  20000.0  2.668288  2.436160
```

after:
```
torch-cat-D1-30108-D2-624-D3-772-dtype-torch.float32:
         B  pt_eager      copy
0    100.0  0.022608  0.036328
1   1000.0  0.133861  0.207399
2  10000.0  1.325120  1.847136
3  20000.0  2.726528  3.693184
torch-cat-D1-30108-D2-624-D3-772-dtype-torch.bfloat16:
         B  pt_eager      copy
0    100.0  0.019942  0.035482
1   1000.0  0.084858  0.144544
2  10000.0  0.924384  1.230672
3  20000.0  1.944448  2.436480

```

### bw analysis:
bw on fp16/bf16 got increased by 40%-50% for large tensors

before:
```
Bandwidth (GB/s) for ((16384, 16384), 1) int8;fp16;fp32;int32;fp64;long|869.87|1382.74|1956.46|1952.73|1969.03|1963.66
Bandwidth (GB/s) for ((4194304,), 0) int8;fp16;fp32;int32;fp64;long|568.43|926.53|1589.20|1567.52|1771.54|1783.68
Bandwidth (GB/s) for ((16777216,), 0) int8;fp16;fp32;int32;fp64;long|752.07|1269.50|1894.86|1900.85|1954.10|1955.08
Bandwidth (GB/s) for ((33554432,), 0) int8;fp16;fp32;int32;fp64;long|807.08|1354.69|1960.48|1962.45|1972.73|1973.85
Bandwidth (GB/s) for ((134217728,), 0) int8;fp16;fp32;int32;fp64;long|864.02|1398.02|1963.43|1955.32|1963.37|1969.96
```

after:
```
Bandwidth (GB/s) for ((16384, 16384), 1) int8;fp16;fp32;int32;fp64;long|873.08|1892.16|1954.35|1962.51|1962.03|1965.98
Bandwidth (GB/s) for ((4194304,), 0) int8;fp16;fp32;int32;fp64;long|575.13|1242.45|1576.37|1571.30|1769.94|1790.22
Bandwidth (GB/s) for ((16777216,), 0) int8;fp16;fp32;int32;fp64;long|742.92|1734.57|1887.99|1897.62|1940.99|1959.25
Bandwidth (GB/s) for ((33554432,), 0) int8;fp16;fp32;int32;fp64;long|802.60|1865.45|1952.64|1947.53|1974.47|1973.48
Bandwidth (GB/s) for ((134217728,), 0) int8;fp16;fp32;int32;fp64;long|865.32|1939.07|1965.72|1963.25|1969.06|1968.72
```

### Perf testing code:

```
# pyre-strict
from typing import List, Optional, Tuple

import click
import pandas as pd

import torch

# @manual=//triton:triton
import triton

# CUDA_VISIBLE_DEVICEs=7 buck2 run @mode/opt //scripts/zhaozhu:cat_bench

@click.command()
@click.option("--data-type", type=str, default="bf16")
@click.option("--return-result", type=bool, default=False)
def main(
    data_type: str,
    return_result: bool,
) -> Optional[Tuple[List[triton.testing.Benchmark], List[pd.DataFrame]]]:
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True
    if data_type == "fp32":
        dtype = torch.float32
    elif data_type == "fp16":
        dtype = torch.float16
    elif data_type == "bf16":
        dtype = torch.bfloat16
    else:
        raise ValueError(f"Unsupported data type: {data_type}.")

    D1 = int(torch.randint(low=10000, high=50000, size=(1,)).item())
    D2 = int(torch.randint(low=100, high=1000, size=(1,)).item())
    D3 = int(torch.randint(low=500, high=1000, size=(1,)).item())

    configs: List[triton.testing.Benchmark] = [
        triton.testing.Benchmark(
            x_names=["B"],
            x_vals=[100, 1000, 10000, 20000],
            line_arg="provider",
            line_vals=["pt_eager", "copy"],
            line_names=["pt_eager", "copy"],
            styles=[("blue", "-"), ("green", "-"), ("red", "-")],
            ylabel="ms",
            plot_name=f"torch-cat-D1-{D1}-D2-{D2}-D3-{D3}-dtype-{dtype}",
            args={
                "D1": D1,
                "D2": D2,
                "D3": D3,
                "dtype": dtype,
            },
        )
    ]

    @triton.testing.perf_report(configs)
    def bench_cat(
        B: int,
        D1: int,
        D2: int,
        D3: int,
        dtype: torch.dtype,
        provider: str,
    ) -> float:
        warmup = 10
        rep = 3

        tensors = []

        a = torch.empty(
            # (B, 30108),
            (B, D1),
            dtype=dtype,
            device=torch.device("cuda"),
        ).uniform_(-1.0, 1.0)
        b = torch.empty(
            # (B, 624),
            (B, D2),
            dtype=dtype,
            device=torch.device("cuda"),
        ).uniform_(-1.0, 1.0)
        c = torch.empty(
            # (B, 772),
            (B, D3),
            dtype=dtype,
            device=torch.device("cuda"),
        ).uniform_(-1.0, 1.0)

        tensors = [a, b, c]

        total_cols: int = int(a.shape[1] + b.shape[1] + c.shape[1])

        def torch_copy(
            tensors: List[torch.Tensor], is_inplace: bool = True
        ) -> torch.Tensor:
            f = torch.zeros([B, total_cols], dtype=dtype, device=torch.device("cuda"))
            col_idx = 0
            for t in tensors:
                temp = f[:, col_idx : col_idx + t.shape[1]]
                if is_inplace:
                    temp.copy_(t)
                else:
                    f[:, col_idx : col_idx + t.shape[1]] = t
                col_idx += t.shape[1]
            return f

        def torch_cat(tensors: List[torch.Tensor]) -> torch.Tensor:
            return torch.cat(tensors, dim=1)

        ref = torch_cat(tensors)
        real = torch_copy(tensors, is_inplace=False)

        torch.testing.assert_allclose(ref, real)

        if provider == "pt_eager":
            fn = lambda: torch_cat(tensors)  # noqa E731
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            return ms
        elif provider == "stack":

            def torch_stack(tensors: List[torch.Tensor]) -> torch.Tensor:
                return torch.stack(tensors, dim=1).view(-1, total_cols)

            fn = lambda: torch_stack(tensors)
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            return ms
        elif provider == "copy":
            fn = lambda: torch_copy(tensors)
            ms = triton.testing.do_bench(fn, warmup=warmup, rep=rep)
            return ms
        else:
            raise ValueError(f"unsupported provider: {provider}")

    df = bench_cat.run(print_data=True, return_df=return_result)

    if return_result:
        return configs, df

if __name__ == "__main__":
    main()
```

and bw analysis code is from: https://github.com/pytorch/pytorch/pull/102815?fbclid=IwZXh0bgNhZW0CMTEAAR1Rwclp_O1fknl1Litpm9GeY0ZZZovdCv8_kQfGf6Zy8LaoP9JhO0ZsutM_aem_BPCZEZda5OOMnzI9Mrlapg#issue-1737409146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150233
Approved by: https://github.com/ngimel
2025-04-03 23:40:18 +00:00
5cf3029503 Remove unused rand call if not fallback to eager for rand (#147790)
Fixes #147171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147790
Approved by: https://github.com/eellison
2025-04-03 23:27:03 +00:00
118e3862bc [dynamo] disable new test_assert_failure_in_generic_ctx_mgr internally (#150631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150631
Approved by: https://github.com/clee2000
ghstack dependencies: #150471
2025-04-03 23:08:25 +00:00
a2dce42654 Split up cub-RadixSortPairs.cu to parallelize compilation (#148936)
Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes.

Differential Revision: D70539649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936
Approved by: https://github.com/suo, https://github.com/eqy, https://github.com/malfet
2025-04-03 23:04:21 +00:00
c0618a3957 Update commitlist.py instructions for the GitHub repo regime (#149535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149535
Approved by: https://github.com/albanD
2025-04-03 22:43:00 +00:00
76994d48f4 [pytorch] add experimental TORCH_LIBRARY_THREAD_UNSAFE_LAZY_INIT (#150537)
Summary: Add an experimental feature to defer pytorch library initialization cost to post startup. As noted this feature is not thread safe, it requires the client to maintain thread safety at library load time.

Reviewed By: zou3519

Differential Revision: D71917841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150537
Approved by: https://github.com/zou3519
2025-04-03 22:36:17 +00:00
9e55dae2a6 CUDA CachingHostAllocator tracks registrations to call correct free (#146520)
Allocations using cudaHostRegister should use corresponding cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost.  In test_cuda.py, the allocator config will change from test to test but the cache is not emptied prior to changing the config.  This results in the wrong free being called later.  Unit test sharding is avoiding this issue, but running the test_cuda.py with a single shard will fail.

The following reproducer demonstrates the problem.

```C++
int main(int argc, char **argv)
{
    void *ptr;
    assert(cudaSuccess == cudaHostAlloc(&ptr, 1024, cudaHostAllocDefault));
    assert(cudaSuccess == cudaHostUnregister(ptr));
    std::free(ptr);
    return 0;
}
```

The above code results in the following failure because the ptr is an invalid argument to cudaHostUnregister.

```
a.out: test.cpp:53: int main(int, char**): Assertion `cudaSuccess == cudaHostUnregister(ptr)' failed.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146520
Approved by: https://github.com/ngimel
2025-04-03 22:33:48 +00:00
c6defa9443 [cuda] Add new faster gammabeta backward kernel (#148605) (Reapply with launch bounds) (#150625)
# Changes over the previous PR

This reverts commit 61a1f09 and adds `__launch_bounds__` to the kernel.

Previously I merged 114d404 that did not work on Blackwell because it consumed too many registers. It got reverted in 61a1f09. For more context see: https://github.com/pytorch/pytorch/issues/150266.

This PR reverts the revert (i.e. reapplies the original diff), with one additional line with `__launch_bounds__` added:

```
git diff HEAD^
diff --git a/aten/src/ATen/native/cuda/layer_norm_kernel.cu b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
index 0d63a2f979c..3ce2c24c18e 100644
--- a/aten/src/ATen/native/cuda/layer_norm_kernel.cu
+++ b/aten/src/ATen/native/cuda/layer_norm_kernel.cu
@@ -657,6 +657,7 @@ bool aligned_grid
 >
 __global__
 void
+__launch_bounds__(block_dim_x * block_dim_y)
  GammaBetaBackwardCUDAKernelTemplate(
     int64_t M,
     int64_t N,
```

I managed to get a Blackwell machine and verified that the fix works. The fix was verified using this repro that I got from @drisspg

<details>
<summary> Repro script that fails on Blackwell </summary>

```
import torch
from torch.nn import init
# from transformer_nuggets import init_logging
# from transformer_nuggets.utils.benchmark import profiler
# from pathlib import Path

# init_logging()

class PermuteModule(torch.nn.Module):
    def __init__(self, permutation):
        super(PermuteModule, self).__init__()
        self.permutation = permutation
    def forward(self, x:torch.Tensor) -> torch.Tensor:
        assert len(x.shape) == len(self.permutation), f"Dimension mismatch! Unable to permute {len(x.shape)} dim input with a {len(self.permutation)} dim permutation!"
        return x.permute(*self.permutation)

def test(n_layers:int, conv_stride:int):
    _sequence = []
    for _ in range(n_layers):
        # Conv1d inputs are (N x C x L), LayerNorm expects (* x C). Dims must be permuted between modules.
        _sequence += [
            PermuteModule((0,2,1)),
            torch.nn.Conv1d(in_channels=512, out_channels=512, groups=1, kernel_size=9, dilation=1, stride=conv_stride, padding=0, bias=False),
            PermuteModule((0,2,1)),
            torch.nn.LayerNorm(512),
            torch.nn.ReLU()
        ]
    model = torch.nn.Sequential(*_sequence).to(device="cuda")
    data = torch.randn((100,2048,512), device="cuda")
    out = model(data)
    loss = torch.nn.functional.mse_loss(out, torch.rand_like(out))
    loss.backward()

torch.autograd.set_detect_anomaly(True)
print(f"Torch version: {torch.__version__}")

# with profiler(Path("conv")):
#     # print(f"layers=1, stride=1")
#     # test(n_layers=1, conv_stride=1)
#     # print(f"layers=2, stride=1")
#     # test(n_layers=2, conv_stride=1)
#     # print(f"layers=1, stride=2")
#     # test(n_layers=1, conv_stride=2)
#     print(f"layers=2, stride=2")
#     test(n_layers=2, conv_stride=2)

print(f"layers=2, stride=2")
test(n_layers=2, conv_stride=2)
# we will not reach this print statement.
print("DONE.")
```

</details>

I also re-ran my performance benchmark and found no regressions over the previous PR.

# Full description of the old PR

Original PR: https://github.com/pytorch/pytorch/pull/148605

This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way.

To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables:

1. dtype in {half, float}
2. M in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
3. N in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
4. Whether we flush the L2 cache before running the backward pass

Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster).

In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the *backward pass* being 1.42x faster than the old *backward pass*.

Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old:

M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.35
    Elapsed Cycles                cycle       27,526
    Memory Throughput                 %         2.21
    DRAM Throughput                   %         0.54
    Duration                         us        20.42
    L1/TEX Cache Throughput           %         4.31
    L2 Cache Throughput               %         2.62
    SM Active Cycles              cycle     1,475.02
    Compute (SM) Throughput           %         0.29
    ----------------------- ----------- ------------
```

M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.34
    Elapsed Cycles                cycle       10,920
    Memory Throughput                 %         5.64
    DRAM Throughput                   %         1.35
    Duration                         us         8.13
    L1/TEX Cache Throughput           %         1.92
    L2 Cache Throughput               %         6.89
    SM Active Cycles              cycle     3,554.41
    Compute (SM) Throughput           %         0.67
    ----------------------- ----------- ------------
```

Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following:

<img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" />

There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made:

![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738)

For dtype=float32, we get a similar chart:

<img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" />

The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension).

The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough.

I am including the regressions here for completeness' sake:

<img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" />

To see this better:

1. Click the image
2. Right click the expanded image and open in a new tab
3. Go to that tab and left click once to zoom in

If you want to see the full data, here it is:

![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1)

I also measured binary size and compile time since those are important for developers:

Binary size comparison

![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51)

```
# Original
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so

# This PR
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so
```

The diff in bytes is 302kB which is about a 0.1% increase.

Compile time difference:

```
# Original

real    0m10.931s
user    0m9.676s
sys     0m1.004s

# this PR

real    0m16.720s
user    0m15.514s
sys     0m1.066s

# Command I ran
time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o

```

So the new PR is 6 seconds longer compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150625
Approved by: https://github.com/ngimel
2025-04-03 22:07:43 +00:00
2abd81402f [validations] Run nccl version check on Linux only (#150635)
Followup https://github.com/pytorch/pytorch/pull/150194 to disable nccl version print on OS's other then Linux
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150635
Approved by: https://github.com/clee2000
2025-04-03 22:06:58 +00:00
941090a791 Make sure torch.compiler._is_compiling_flag=True in aoti (#150588)
Summary: See internal Diff summary

Differential Revision: D72355449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150588
Approved by: https://github.com/angelayi
2025-04-03 22:02:29 +00:00
5a654deb40 Revert "Enable C++ dynamic shape guards by default (#140756)"
This reverts commit c1d503529d23f33bc0819286df8d0ecbe31b559f.

Reverted https://github.com/pytorch/pytorch/pull/140756 on behalf of https://github.com/isuruf due to new test test_runtime_checks_large hangs on CI ([comment](https://github.com/pytorch/pytorch/pull/140756#issuecomment-2776979814))
2025-04-03 21:44:41 +00:00
d41c22b578 Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)" (#150542)
Reverts #148261 due to possible memory leak

This reverts commit 5d4e7d58b42623a9024a84f0050967ff0318dcdb.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150542
Approved by: https://github.com/clee2000
2025-04-03 21:15:38 +00:00
277369ac16 Move formulas on separate line in loss.py (#150565)
Move formulas on separate line in loss.py for better readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150565
Approved by: https://github.com/mikaylagawarecki
2025-04-03 20:47:35 +00:00
a3f9e04656 [export] Make aoti_call_delegate hop traceable (#148804)
Summary: The `aoti_call_delegate` hop now uses a stateless `original_gm` for tracing with fake tensors and the OSS AOTI Runner for running with real tensors

Differential Revision: D70738393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148804
Approved by: https://github.com/SherlockNoMad
2025-04-03 20:44:31 +00:00
51da241c0a [aoti] Fix cannot determine truth value of Relation error when propagating unbacked symint in lowering (#150570)
Summary: Fix  cannot determine truth value of Relation error when propagating unbacked symint in lowering

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts
```

Differential Revision: D72331070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150570
Approved by: https://github.com/angelayi, https://github.com/henryoier
2025-04-03 20:06:15 +00:00
c1d503529d Enable C++ dynamic shape guards by default (#140756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756
Approved by: https://github.com/anijain2305
ghstack dependencies: #149149, #149197, #149211
2025-04-03 20:03:52 +00:00
1843ad458d [Inductor] Cache CUDA compilation errors (#149716)
Summary: Add support for caching of CUDA (nvcc) compilation errors to codecache.py

Test Plan: CI ( for example Cutlass backend unit tests )

Reviewed By: ColinPeppler

Differential Revision: D71562040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149716
Approved by: https://github.com/ColinPeppler
2025-04-03 19:47:27 +00:00
3b02f795c5 Add torch._scaled_mm for CPU (#150410)
This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975.

This PR is to add torch._scaled_mm for CPU backend.

_scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410
Approved by: https://github.com/atalman
2025-04-03 19:43:45 +00:00
96f35f55e2 update get start xpu document for v2.7 (#150397)
update get start xpu document for v2.7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150397
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-03 18:17:08 +00:00
78d1165d76 [DTensor][tp] fix errors in FSDP+TP checkpointing test (#150354)
## Summary
remove the `tp_parallelize_plan` assignment that accidentally rewrites the previous assignments in `test_fsdp_dsd.py`.

## Test
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150354
Approved by: https://github.com/wconstab
2025-04-03 17:41:46 +00:00
5d36253a7d Refactoring: fix the python constant check (#150608)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150608
Approved by: https://github.com/Skylion007
2025-04-03 17:33:45 +00:00
fa0fdc0cca if blaslt fails, fall back to blas (#150147)
Fixes #150016.

This is implemented for both cublaslt and hipblaslt. gemm_and_bias on failure will fall back to unfused path. lt gemm on failure falls back to gemm even if gemm preference is set to lt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150147
Approved by: https://github.com/malfet
2025-04-03 16:18:59 +00:00
5be5cfe4cb [inductor][autotune cache] add torch_key() to configs hash (#150494)
Summary:
**Context**: https://github.com/pytorch/pytorch/pull/150122 (D71982587 - let's call this "the WS diff") introduces "bc/fc-breaking" cache changes.

In particular, it introduces `num_consumer_groups` and adds it to the cached config. In versions of torch that include the WS diff, `num_consumer_groups` is treated as a class variable on a triton.Config object (i.e. `triton.Config({..kwargs..}, num_consumer_groups=num_consumer_groups, ...`). And in versions of torch that don't include the WS diff, you generally don't expect to see this kwarg.

But if a program is run WS-torch (i.e. torch w/ the WS diff), and then later you run the same program with non-WS-torch, then non-WS-torch is going to find this autotune cache entry, and interpret `num_consumer_groups` as a kwarg, because there's no special handling for for num_consumer_groups in this version of torch. Then the program crashes with a triton failure message.

**The fix**: add the torch version / torch key into the hash, so that any changes to inductor will invalidate the cache (ensuring that other changes to triton_heuristics won't cause these bc/fc issues).

Test Plan: D72285868 (or https://gist.github.com/davidberard98/2ea697eb550c94d0d1948fedb5c5c7d8, but this doesn't repro in OSS because this version of warp specialization is not available in oss triton) can repro the failure, and the failure is fixed after this PR is patched.

Also, added a test in test/inductor/test_codecache.py which verifies that there's no cache hit if the torch_key changes (and verified that without the functional changes in this PR, the test fails).

Differential Revision: D72285303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150494
Approved by: https://github.com/oulgen
2025-04-03 16:01:57 +00:00
440c07e56a Fix detection of GPU multicast (#150563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150563
Approved by: https://github.com/kwen2501
2025-04-03 15:31:15 +00:00
5314a6fe82 [export] Fix deserialization issue (#150515)
An internal model was serialized in 2023, and is now breaking while loading with the following error:
```
  File "<eval_with_key>.1675", line 4
    def forward(self, arg1163_1, arg1164_1, , arg1166_1, , arg1168_1, arg1169_1, arg1170_1, , arg1172_1, arg1173_1, arg1174_1, arg1175_1, arg1176_1, arg1177_1, arg1178_1, arg1179_1, arg1180_1, arg1181_1, arg1182_1, arg1183_1, arg1184_1, arg1185_1, arg1186_1, arg1187_1, arg1188_1, arg1189_1, arg1190_1, arg1191_1, arg1192_1, arg1193_1, arg1194_1, arg1195_1, arg1196_1, arg1197_1, arg1198_1, arg1199_1, arg1200_1, arg1201_1, arg1202_1, arg1203_1, arg1204_1, arg1205_1, arg1206_1, arg1207_1, arg1208_1, arg1209_1, arg1210_1, arg1211_1, arg1212_1, arg1213_1, arg1214_1, arg1215_1, arg1216_1, , arg1218_1, arg1219_1, arg1220_1, arg1221_1, arg1222_1, arg1223_1, arg1224_1, , arg1226_1, arg1227_1, arg1228_1, , arg1230_1, , , , , , , , , , , , , , , ):
                                            ^
SyntaxError: invalid syntax
```

The syntax errors are due to inputs that are `None` when exporting. Prior to changes in https://github.com/pytorch/pytorch/pull/123590 (landed 4/2024), input specs for none inputs look like `InputSpec(userInput=UserInputSpec(arg=Argument(asNone=True)))`, and during deserialization when creating a node, we would just use a dummy name `arg`. After to those changes, the input specs for none inputs look like `InputSpec(constantInput=InputToConstantInputSpec(name='y', value=ConstantValue(asNone=True)))`, and when creating  a node we would use the name `y` as the name. However the PR didn't handle the case if it's loading an old package which doesn't have this name, so ended up putting empty names in the placeholder nodes.

This error was uncovered after https://github.com/pytorch/pytorch/pull/149717, where we now use the GraphModule's python codegen to run the UnflattenedModule instead of going through the interpreter path. The placeholder nodes having empty names caused the python codegen to fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150515
Approved by: https://github.com/yushangdi
2025-04-03 15:27:45 +00:00
a72b4eb806 Support windows in C++ shape guards (#149211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149211
Approved by: https://github.com/anijain2305
ghstack dependencies: #149149, #149197
2025-04-03 14:42:08 +00:00
f9a7eac718 use python fallback if there are overflows (#149197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149197
Approved by: https://github.com/anijain2305
ghstack dependencies: #149149
2025-04-03 14:39:03 +00:00
ff783f062a Fix shape guard failure to be valid python (#149149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149149
Approved by: https://github.com/anijain2305
2025-04-03 14:36:17 +00:00
70b34a42c1 Add new dependences for gen_pyi.py (#150391)
As the title stated.

When we update some functions in _torch_docs.py or _tensor_docs.py, and execute some commands (like ``python setup.py evolve``) to install the latest version, the description about the function we just changed is not updated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150391
Approved by: https://github.com/Skylion007, https://github.com/peterbell10
2025-04-03 14:18:18 +00:00
781d28e265 add unit test for preferred_blas_library settings (#150581)
Follow up to #150212 that was committed without a unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150581
Approved by: https://github.com/atalman
2025-04-03 13:27:50 +00:00
cbc901fac3 Implement raise ... from ... (#148766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148766
Approved by: https://github.com/zou3519
2025-04-03 13:15:31 +00:00
e0d19cf6cc Enable weekly test for operator benchmark (#150502)
To regularly track the performance of the operator benchmark, enable the weekly test.

Hi, @huydhn, as you mentioned in https://github.com/pytorch/pytorch/pull/143733#issuecomment-2578317520, we could integrate the performance data from the weekly test into the OSS benchmark database for the dashboard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150502
Approved by: https://github.com/huydhn
2025-04-03 12:17:19 +00:00
5d9c7f78e7 [fbcode]Removing @NoIntBaseDeprecated annotation in evaluation.thrift file (#150271)
Summary: #buildall

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)'
```

Differential Revision: D72028940

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150271
Approved by: https://github.com/huydhn
2025-04-03 12:01:59 +00:00
d4c30b4599 [AOTI][dashboard] Update how peak memory is measured (#150534)
Summary: In the dashboard measurement script, AOTI needs to run Eager first to register the output pytree, so the peak memory compression ratio on the dashboard is always close to 1. Update AOTI run to use an extra warmup run, so the peak memory compression ratio measures the result at the run time instead of the compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150534
Approved by: https://github.com/yushangdi
2025-04-03 12:01:43 +00:00
6fa1b17195 ROCm: Add trailing comma for consistency in gfx architecture list (#150250)
Adding trailing comma for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150250
Approved by: https://github.com/petrex, https://github.com/jeffdaily, https://github.com/cyyever
2025-04-03 10:58:48 +00:00
e6e07ec1cf [ROCm] code cleanup of architecture checks (#150473)
This PR replaces several calls to `at::cuda::getCurrentDeviceProperties()->gcnArchName` and `at::cuda::getDeviceProperties(device_index)->gcnArchName` when checking to see if the GPU architecture is in a certain list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150473
Approved by: https://github.com/jeffdaily, https://github.com/cyyever
2025-04-03 09:51:06 +00:00
9e106019f6 [XPU] Add an implict conversion from XPUStream to sycl::queue* (#148646)
# Motivation

Currently, in Pytorch XPU, `cudaStream_t` is mapped to `sycl::queue&`, so an implicit cast from `XPUStream` to `sycl::queue&` is provided just like `CUDAStream` has an implicit cast to `cudaStream_t`.

But on the SYCLomatic side, we migrate `cudaStream_t` to `sycl::queue*` but not `sycl::queue&` (One reason is that `cudaStream_t` is actually a pointer so users can do anything with that integer. Another reason is that the early `sycl::queue` was not impl-ed by a pointer, so copy by value is not desirable.)

Without this PR:
```
cudaStream_t a = getCurrentCUDAStream();
cudaStream_t b = getCurrentCUDAStream().stream();
```
need be migrated to:
```
queue_ptr a = &(sycl::queue&)getCurrentXPUStream();
queue_ptr b = &(getCurrentXPUStream().queue());
```
With this PR:
```
queue_ptr a = getCurrentXPUStream();
queue_ptr b = &(getCurrentXPUStream().queue());
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148646
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-04-03 08:12:38 +00:00
c067127d47 Ensure cuda_dlink_post_cflags are quoted as well (#150151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150151
Approved by: https://github.com/janeyx99
2025-04-03 06:50:22 +00:00
fc674b45d4 [c10d] Add logging for desync debug report (#150513)
Summary: We want to add a logging to first understand what is the distribution of desync debug report.

Test Plan: Test with logger staging

Differential Revision: D72249281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150513
Approved by: https://github.com/kwen2501
2025-04-03 06:42:06 +00:00
90ddb33141 [export] specialize for aten.to (#149235)
Changes decomposition behavior of `aten.to` to respect the aliasing/non-aliasing behavior in eager, and to specialize to the input/conversion dtype & device.

Before change: we always decompose `aten.to` into `_to_copy`, regardless of aliasing behavior. This leads us to ban mutations on the result of `_to_copy` when aliased, since we can't guarantee correct program semantics. This meant users had to explicitly call `.clone()` before mutating. In the special cases where we don’t ban mutations (e.g. dtype conversion), we add runtime assertions on the input & conversion dtype/devices in the decomposed program (see https://github.com/pytorch/pytorch/pull/142420).

After change: we decompose to the aliasing/non-aliasing behavior that matches eager, allowing mutations in all cases. We also add dtype/device assertions for all `aten.to` ops, starting in the pre-dispatch graph, basically specializing the program to the dtype/devices.

Differential Revision: D71229547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149235
Approved by: https://github.com/tugsbayasgalan
2025-04-03 05:20:10 +00:00
2e5d95a082 [FlexAttention] Remove dead code (#150575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150575
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-04-03 01:46:19 +00:00
77dca3947e [aoti] make a check function for each input (#150553)
Summary: make a check function for each input to avoid too large to optimize error on `__check_inputs_outputs`

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r runtime_checks
```

Differential Revision: D72286280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150553
Approved by: https://github.com/desertfire
2025-04-03 00:55:35 +00:00
13f48197d2 Add Chillee as core reviewer (#150579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150579
Approved by: https://github.com/albanD, https://github.com/drisspg, https://github.com/malfet
2025-04-03 00:40:06 +00:00
f363fe616d [AOTInductor] Fix autotuning code's codegen (#150522)
Summary:
Codegen used to generate tmp_arg_{index} as temporary args, and index is the position of the caller.
We changed the logic of codegen such that we can reuse previous generated samples, and only delete after arg is no longer used. In this case, we need to make {index} unique, since different functions could reuse the same "tmp_arg_{index}" name string, but corresponds to different args.

Test Plan: `python test/inductor/test_aot_inductor.py -k test_autotuning_args_reuse`

Differential Revision: D72297084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150522
Approved by: https://github.com/desertfire, https://github.com/22quinn
2025-04-03 00:08:19 +00:00
24f50653c8 fix bug in logging code (#150518)
Fixes https://github.com/pytorch/pytorch/issues/150379

```python
>>> key = "aten._int_mm_1_2_3"
>>> m, n, k = key.split("_")[-3:]
>>> m, n, k
('1', '2', '3')
>>> name = "_".join(key.split("_")[:-3])
>>> name
'aten._int_mm'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150518
Approved by: https://github.com/xmfan
2025-04-02 23:39:06 +00:00
61a1f09b5b Revert "[cuda] Add new faster gammabeta backward kernel (#148605)"
This reverts commit 114d404b0720e8073748690faeb96449e5c0b229.

Reverted https://github.com/pytorch/pytorch/pull/148605 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/150266#issuecomment-2773907902 for more details ([comment](https://github.com/pytorch/pytorch/pull/148605#issuecomment-2773928838))
2025-04-02 23:14:11 +00:00
de15ef0ee8 [invoke_subgraph] Force grad_outs to be contiguous at tracing time (#150561)
I am unable to come up with a testcase. It passes many end-to-end tests that fail with ReshapeError at https://ossci-raw-job-status.s3.amazonaws.com/log/39717218372

![image](https://github.com/user-attachments/assets/8509b485-3897-4538-968b-bbe05af63a59)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150561
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #150082, #150450, #150486, #150556
2025-04-02 22:59:08 +00:00
0198e44f37 Update torch-xpu-ops commit pin to 98c808d (#150554)
Update the torch-xpu-ops commit to [98c808dea6de7330c415aa777d6921944cf79887](98c808dea6), include

- Fixes #150001 by removing pre-CXX11 ABI logic from build script for XPU
- Fixes #150430
- Fixes XCCL build issue caused by PR #150398

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150554
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-04-02 22:42:18 +00:00
8667a00979 Add stride + dtype to autotune results (#150419)
Add stride/dtype info to autotune gemm results. New output header:

`AUTOTUNE mm(1024x1024, 1024x7680)`
`strides: [1, 1024], [7680, 1]`
`dtypes: torch.bfloat16, torch.bfloat16`

Differential Revision: [D72253313](https://our.internmc.facebook.com/intern/diff/D72253313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150419
Approved by: https://github.com/eellison
2025-04-02 22:36:38 +00:00
0bacb90a9c [invoke_subgraph][min-cut partitioner] Fix bug to use the correct root module (#150556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150556
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #150082, #150450, #150486
2025-04-02 22:35:00 +00:00
a677b491c9 [Profiler] Fix Empty C Call Queue (#150370)
Summary:
My commandeer of https://github.com/pytorch/pytorch/pull/150102

Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging.

Contributors: @arjun-choudhry

Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues.

Differential Revision: D72207570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370
Approved by: https://github.com/aaronenyeshi
2025-04-02 22:25:46 +00:00
74aa9f571c ci: Use cache / progress when local docker build (#150551)
It's a bit annoying to try and work on these locally when the cache /
progress isn't being used so let's just set it so that those flags are
only valid when in CI directly.

`${CI}` is a default environment variable that's defined by actions
itself.

See https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/store-information-in-variables#default-environment-variables

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150551
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman
2025-04-02 22:08:57 +00:00
1017927c83 multidimensional slicing (#150104)
Differential Revision: D71962884

Fixes #150057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150104
Approved by: https://github.com/angelayi
2025-04-02 20:57:16 +00:00
bb98749230 [dynamo] Always trace into tensor subclass __torch_function__ (#149792)
This patch effectively ignores traceable_tensor_subclasses, allowing
Dynamo to always try tracing into the `__torch_function__` of tensor
subclass. This helps us with 2 things:
1. allowing users to directly benefit from better compilation of tensor
   subclass, by just upgrading pytorch, without having to change legacy
   library code (see earlier patches in the stack for examples).
2. potentially exposing more issues in compiling tensor subclass, so we
   can get signals and improve them.

As a consequence, it exposed and fixes 2 subtle bugs:
1. In `build_torch_function_fn`, we could get
   `torch._C._disabled_torch_function_impl` because we have a
   `Parameter` subclass without `__torch_function__` override or if we
   have a tensor subclass with `__torch_dispatch__` override. We graph
   break on this for now, and plan to add support -- the logic for
   simulating `torch._C._disabled_torch_function_impl` is already in
   `SuperVariable`, we just need to reuse it.
2. Sometimes we create `SyntheticLocalSource` and need to remove all the
   guards installed on it, but we only removed the ones whose source
   _is_ the created synthetic source `s`, but forgot about chained
   source like `s.foo`, this showed up as
   `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`.

Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #149482, #149483, #149484
2025-04-02 20:57:00 +00:00
3463ea1059 [dynamo] Support tensor subclass with overriden tensor methods and properties (#149484)
This fixes most of the "torch.compile X tensor-subclass" issues
encountered in https://github.com/city96/ComfyUI-GGUF/issues/118. The
relevant tensor subclass definition is here:
298192ed60/ops.py (L18-L65).

A few things to note about the tensor subclass:
1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`,
   `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr`
   to support that.
2. it overrides the `shape` property, so this patch updates
   `TensorWithTFOverrideVariable.var_getattr` to support property as well.
3. it has calls to `torch.Tensor.size`, which returns `torch.Size`,
   which gets reconstructed in `torch.Tensor.__torch_function__`, so
   this patch adds support for calling `torch.Size(...)` on non-constant
   inputs.

Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149484
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #149482, #149483
2025-04-02 20:57:00 +00:00
0d4dbfd9ed [dynamo] Support torch.Tensor._make_subclass and tracing through tensor subclass __new__ (#149483)
This builds off the previous patch in the stack, and fully fixes
https://github.com/huggingface/diffusers/issues/10795.

Essentially, tensor subclass in the issue uses
`torch.Tensor._make_subclass`, which has a pretty simple shallow-copy
plus type change semantics, as far as Dynamo is concerned. So this patch
adds a polyfill for it.

As a result, this allows us to trace through many user-defined `__new__`
in tensor subclass (it's similar to how we trace through user-defined
`__new__` for `UserDefinedClassVariable`), so this patch also faithfully
trace through these `__new__` methods.

Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149483
Approved by: https://github.com/zou3519, https://github.com/mlazos
ghstack dependencies: #149482
2025-04-02 20:56:52 +00:00
33535b3eee [dynamo] Support Tensor subclass that has dynamic attributes or calls Parameter.__torch_function__ (#149482)
This fixes most of https://github.com/huggingface/diffusers/issues/10795,
except for `torch.Tensor._make_subclass`, which will be fixed in a
subsequent patch.

The relevant tensor subclass from the aforementioned issue is defined
here: fbf6b856cc/src/diffusers/quantizers/gguf/utils.py (L398-L435).

There are two things to note about the tensor subclass:
1. it calls `super().__torch_function__`, which is
   `torch._C._disabled_torch_function_impl`, so this patch updates
   `SuperVariable.call_method` to handle it (we can't do a simpler
   polyfill due to some bug with `var_getattr` raising
   `NotImplementedError`, which forgot to restore symbolic context).
2. it sets and reads attributes (`quant_type`), and
   defines new methods (`as_data`), so this patch adds support for those.
3. it has a `__init__`, which Dynamo needs to trace through in
   `TensorSubclassVariable.call_function`.

Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149482
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-04-02 20:56:43 +00:00
85df0dc246 [dynamo] emit only 1 graph break message on unrecoverable data-dependent assert fail (#150471)
Addresses https://fb.workplace.com/groups/1075192433118967/permalink/1625299684774903/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150471
Approved by: https://github.com/jansel
2025-04-02 20:42:43 +00:00
a8f6b40e36 [inductor] skip non-trivial tiling if unbacked symints are present (#150225)
Take two of https://github.com/pytorch/pytorch/pull/149994.

This time we just skip `convert_tiling_to_3d` and `candidate_tilings` if there exists unbacked symints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150225
Approved by: https://github.com/eellison
2025-04-02 20:36:02 +00:00
03c879d59b Revert "[dynamo] Support Tensor subclass that has dynamic attributes or calls Parameter.__torch_function__ (#149482)"
This reverts commit 98453c135a7778d12ff881d8b0a717257be9fc38.

Reverted https://github.com/pytorch/pytorch/pull/149482 on behalf of https://github.com/malfet due to Broke trunk, see b03c42109c/1 ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))
2025-04-02 20:30:33 +00:00
18908c8ced Revert "[dynamo] Support torch.Tensor._make_subclass and tracing through tensor subclass __new__ (#149483)"
This reverts commit 203e1d681d1a4eb7794dfaeaebfa497242dde17d.

Reverted https://github.com/pytorch/pytorch/pull/149483 on behalf of https://github.com/malfet due to Broke trunk, see b03c42109c/1 ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))
2025-04-02 20:30:33 +00:00
01411c739f Revert "[dynamo] Support tensor subclass with overriden tensor methods and properties (#149484)"
This reverts commit 7e53c58687482d58461e1dd8e09f59a9daf8f7b3.

Reverted https://github.com/pytorch/pytorch/pull/149484 on behalf of https://github.com/malfet due to Broke trunk, see b03c42109c/1 ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))
2025-04-02 20:30:33 +00:00
e545567340 Revert "[dynamo] Always trace into tensor subclass __torch_function__ (#149792)"
This reverts commit 238109ad3245c5485f9e83b4b02d258b09329042.

Reverted https://github.com/pytorch/pytorch/pull/149792 on behalf of https://github.com/malfet due to Broke trunk, see b03c42109c/1 ([comment](https://github.com/pytorch/pytorch/pull/149482#issuecomment-2773650522))
2025-04-02 20:30:32 +00:00
af5c1b96e2 ci: Set minimum cmake version for halide build (#150560)
This was failing due to pybind being strict about their cmake version
requirements.

This resolves errors like:
```
652.1   Compatibility with CMake < 3.5 has been removed from CMake.
652.1
652.1   Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
652.1   to tell CMake that the project requires at least <min> but has been updated
652.1   to work with policies introduced by <max> or earlier.
652.1
652.1   Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
652.1
652.1
652.1 -- Configuring incomplete, errors occurred!
```

Tested this locally with the following command:

```
./build.sh pytorch-linux-jammy-py3.12-halide -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-jammy-py3.12-halide:8a8989876ff1aa1d5b0e465177afebbc7a9da921
```

Closes https://github.com/pytorch/pytorch/issues/150420

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150560
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet
2025-04-02 20:27:24 +00:00
b03c42109c Proactively remove CompiledTritonKernels before loading from cache/starting inductor compile (#150453)
We'll still running into this issue intermittently and it's hard to debug; so I thought a more aggressive cache clear strategy may fix it as a stopgap until we can Statically launch cuda kernels and avoid some of this stuff

Differential Revision: [D72257973](https://our.internmc.facebook.com/intern/diff/D72257973/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150453
Approved by: https://github.com/oulgen
2025-04-02 20:08:32 +00:00
22030efb64 expect fail scan test in sigmoid (#150475)
Summary: as titled.

Test Plan: see modified test.

Differential Revision: D72271976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150475
Approved by: https://github.com/zhxchen17
2025-04-02 19:56:50 +00:00
d4298f2136 [CI] Use system nccl in build (#150226)
Install nccl in the docker image (which is already being done in some docker images), and use USE_SYSTEM_NCCL=1 in CI builds

It takes some time to build nccl and doesn't happen in parallel, so theres less benefit in switching to a bigger runner and using more processes

The other changes in this PR are because there is an install_cuda script and an install_cuda_aarch64 script and they both build nccl from source and define their own pins for the nccl version.  There is also a .ci/docker/nccl-cu11.txt and cu12.txt that define the pins, and this is an attempt to unify them.  Unfortunately this leads to a lot of files needing to be copied to the docker build

Generally seems to increase docker pull times by <1 min, P1768456379 but its hard to tell what the real increase is
15761 mib -> 16221 [linux-focal-cuda11.8-py3.10-gcc9 / test (distributed](https://github.com/pytorch/pytorch/actions/runs/14114171729/job/39545500161#logs)
`jq '[.layers[].size, .config.size] | add / 1024 / 1024'`

Example 6eb3c2e282 (39520169577-box)
![image](https://github.com/user-attachments/assets/d44ef415-6e48-41ef-ac83-f19bab47560c)

TODO:
* Figure out a way to verify that nccl was built + works properly when it is expected (this time i just checked torch.distributed.is_nccl_available)
* Merge the cusparse installation scripts
* Merge the cuda installation scripts
* Either split the nccl, cuda, and cusparse installations always, or make the always together in one bash script

distributed/test_distributed_spawn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150226
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-04-02 19:42:43 +00:00
cb4cd6166e Address Cmake update issue in windows magma builds (#150549)
1. Fixes Cmake update error: https://github.com/pytorch/pytorch/actions/runs/14223930697/job/39858632864
```
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 3.5 has been removed from CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

  Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
```
2.  Removes deprecated CUDA 12.4 build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150549
Approved by: https://github.com/clee2000
2025-04-02 19:13:44 +00:00
e62d958f02 [Inductor] Reland Merge Triton ScaledMM as epilogue to MM template #150045 (#150441)
Merges https://github.com/pytorch/pytorch/pull/150438 and https://github.com/pytorch/pytorch/pull/150045. https://github.com/pytorch/pytorch/pull/150045 was already landed, but did not include a change that makes it unable to land internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150441
Approved by: https://github.com/clee2000
2025-04-02 17:49:32 +00:00
238109ad32 [dynamo] Always trace into tensor subclass __torch_function__ (#149792)
This patch effectively ignores traceable_tensor_subclasses, allowing
Dynamo to always try tracing into the `__torch_function__` of tensor
subclass. This helps us with 2 things:
1. allowing users to directly benefit from better compilation of tensor
   subclass, by just upgrading pytorch, without having to change legacy
   library code (see earlier patches in the stack for examples).
2. potentially exposing more issues in compiling tensor subclass, so we
   can get signals and improve them.

As a consequence, it exposed and fixes 2 subtle bugs:
1. In `build_torch_function_fn`, we could get
   `torch._C._disabled_torch_function_impl` because we have a
   `Parameter` subclass without `__torch_function__` override or if we
   have a tensor subclass with `__torch_dispatch__` override. We graph
   break on this for now, and plan to add support -- the logic for
   simulating `torch._C._disabled_torch_function_impl` is already in
   `SuperVariable`, we just need to reuse it.
2. Sometimes we create `SyntheticLocalSource` and need to remove all the
   guards installed on it, but we only removed the ones whose source
   _is_ the created synthetic source `s`, but forgot about chained
   source like `s.foo`, this showed up as
   `SYNTHETIC_LOCAL['tmp_0'].__torch_function__.__func__`.

Differential Revision: [D71906141](https://our.internmc.facebook.com/intern/diff/D71906141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149792
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #149482, #149483, #149484
2025-04-02 17:05:25 +00:00
7e53c58687 [dynamo] Support tensor subclass with overriden tensor methods and properties (#149484)
This fixes most of the "torch.compile X tensor-subclass" issues
encountered in https://github.com/city96/ComfyUI-GGUF/issues/118. The
relevant tensor subclass definition is here:
298192ed60/ops.py (L18-L65).

A few things to note about the tensor subclass:
1. it overrides a lot of the `torch.Tensor` methods (e.g., `to`,
   `clone`), so this patch updates `TensorWithTFOverrideVariable.var_getattr`
   to support that.
2. it overrides the `shape` property, so this patch updates
   `TensorWithTFOverrideVariable.var_getattr` to support property as well.
3. it has calls to `torch.Tensor.size`, which returns `torch.Size`,
   which gets reconstructed in `torch.Tensor.__torch_function__`, so
   this patch adds support for calling `torch.Size(...)` on non-constant
   inputs.

Differential Revision: [D71906137](https://our.internmc.facebook.com/intern/diff/D71906137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149484
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #149482, #149483
2025-04-02 17:05:25 +00:00
203e1d681d [dynamo] Support torch.Tensor._make_subclass and tracing through tensor subclass __new__ (#149483)
This builds off the previous patch in the stack, and fully fixes
https://github.com/huggingface/diffusers/issues/10795.

Essentially, tensor subclass in the issue uses
`torch.Tensor._make_subclass`, which has a pretty simple shallow-copy
plus type change semantics, as far as Dynamo is concerned. So this patch
adds a polyfill for it.

As a result, this allows us to trace through many user-defined `__new__`
in tensor subclass (it's similar to how we trace through user-defined
`__new__` for `UserDefinedClassVariable`), so this patch also faithfully
trace through these `__new__` methods.

Differential Revision: [D71906139](https://our.internmc.facebook.com/intern/diff/D71906139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149483
Approved by: https://github.com/zou3519, https://github.com/mlazos
ghstack dependencies: #149482
2025-04-02 17:05:19 +00:00
98453c135a [dynamo] Support Tensor subclass that has dynamic attributes or calls Parameter.__torch_function__ (#149482)
This fixes most of https://github.com/huggingface/diffusers/issues/10795,
except for `torch.Tensor._make_subclass`, which will be fixed in a
subsequent patch.

The relevant tensor subclass from the aforementioned issue is defined
here: fbf6b856cc/src/diffusers/quantizers/gguf/utils.py (L398-L435).

There are two things to note about the tensor subclass:
1. it calls `super().__torch_function__`, which is
   `torch._C._disabled_torch_function_impl`, so this patch updates
   `SuperVariable.call_method` to handle it (we can't do a simpler
   polyfill due to some bug with `var_getattr` raising
   `NotImplementedError`, which forgot to restore symbolic context).
2. it sets and reads attributes (`quant_type`), and
   defines new methods (`as_data`), so this patch adds support for those.
3. it has a `__init__`, which Dynamo needs to trace through in
   `TensorSubclassVariable.call_function`.

Differential Revision: [D71906140](https://our.internmc.facebook.com/intern/diff/D71906140)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149482
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-04-02 17:05:12 +00:00
532530be34 Revert "[Profiler] Fix Empty C Call Queue (#150370)"
This reverts commit 5734909f343ab1de44ed5ab23311d43a9c6afaed.

Reverted https://github.com/pytorch/pytorch/pull/150370 on behalf of https://github.com/clee2000 due to broke some profiler tests when building with debug asserts profiler/test_memory_profiler.py::TestMemoryProfiler::test_config_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/14211763078/job/39822158330) [HUD commit link](3ac5a499dd) ([comment](https://github.com/pytorch/pytorch/pull/150370#issuecomment-2773146070))
2025-04-02 16:40:54 +00:00
f38566dfe4 [MPSInductor] Disable mm/bmm decompositions (#150541)
Disables mm/bmm decompositions.
torch.compile on MPS was speeding up stories15M (~4x) but it was making stories110M much slower.

Self-contained reproducer to demonstrate the difference (before the change, after it should be identical)
```python
import torch
import timeit

def bench_mm(f, x, y):
    from torch.utils.benchmark import Timer
    return Timer(stmt="f(x, y); torch.mps.synchronize()",
                 globals={"x": x, "y": y, "f": f},
                  language="python", timer=timeit.default_timer).blocked_autorange()

x = torch.rand(1024, 512, device='mps')
y = torch.rand(512, 1, device='mps')

mm_c = torch.compile(torch.mm, options={"coordinate_descent_tuning": False})
mm_c_cdt = torch.compile(torch.mm, options={"coordinate_descent_tuning": True})

print(f"Compiled torch.mm perf (with cdt disabled) for 1024x512 and  512x1 matrices are {bench_mm(mm_c, x, y).median}")
print(f"Compiled torch.mm perf (with cdt enabled) for 1024x512 and  512x1 matrices are {bench_mm(mm_c_cdt, x, y).median}")
```

Disabling the inductor mm decomposition, speeds up stories15M further (~6x) and speeds up stories110M (~7x)
The table below show average tokens/sec across 5 runs on M1 Pro for stories15M and stories110M:

|                        | stories15M | stories110M |
|------------------------|------------|-------------|
| without compile         | 99.40      | 53.11       |
| compile before change   | 367.68     | 19.43       |
| compile after change    | 582.96     | 355.07      |

stories110M (without compile)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps
[...]
Average tokens/sec: 53.11
```

stories110M (compile before change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 19.43
```

stories110M (compile after change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 355.07
```

stories15M (without compile)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps
[...]
Average tokens/sec: 99.40
```

stories15M (compile before change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 367.68
```

stories15M (compile after change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 582.96
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150541
Approved by: https://github.com/malfet
2025-04-02 16:07:18 +00:00
8102272d8c [BE] Fix triton windows build (#150512)
Fixes #150480
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150512
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-04-02 15:48:11 +00:00
42c7c7f15f [invoke_subgraph] Filter out grad_out where fw_out requires_grad is False (#150486)
I am not sure if this is the right way.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150486
Approved by: https://github.com/zou3519
ghstack dependencies: #150082, #150450
2025-04-02 14:40:08 +00:00
82ceebce58 [inductor] Lowerings for max_pool3d (#148210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148210
Approved by: https://github.com/eellison
2025-04-02 14:13:01 +00:00
5f62d07ec6 Fix log2, PowByNatural printing (#147592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147592
Approved by: https://github.com/eellison
2025-04-02 14:12:15 +00:00
aae36929ed Rename node.meta["arg_kwarg_vals"] to node.meta["eager_input_vals"] (#148092)
And added a comment about it. Otherwise it might be confusing

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148092
Approved by: https://github.com/eellison
ghstack dependencies: #148046, #148063, #148091
2025-04-02 13:18:04 +00:00
4d121d2b02 Implement needs_exact_strides for mutable custom operators (#148091)
Mutable custom operators get wrapped into an auto_functionalized HOP, so
we need to store the arg_kwarg_vals on the auto_functionalized HOP
itself.

When Inductor does the re-inplacing, it'll use the pattern matcher to
decompose the auto_functionalized HOP back into the original op (and
0+ other view or clone operations). The pattern matcher uses the
arg_kwarg_vals to trace the subgraph to do the decomposition, so it
ultimately sets arg_kwarg_vals on the original op's node correctly.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148091
Approved by: https://github.com/eellison
ghstack dependencies: #148046, #148063
2025-04-02 13:18:04 +00:00
c69c3c885e Add needs_exact_strides operator tag for Inductor to force exact strides (#148063)
Inductor will force exact strides on a custom operator tagged with
needs_exact_strides. I'll make this the default in a follow-up PR.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148063
Approved by: https://github.com/eellison
ghstack dependencies: #148046
2025-04-02 13:17:58 +00:00
c41fbb4f78 Change arg_kwarg_vals propagation strategy (#148046)
Instead of always propagating arg_kwarg_vals in _COPY_META_FIELDS, we
special-case the pattern matcher to propagate arg_kwarg_vals when
it sees triton_kernel_wrapper_functional.

The strategy is:
1) trace out the replacement graph with arg_kwarg_vals (which have accurate eager-mode metadata)
2) trace out the replacement graph with vals (which have the accurate Inductor metadata)
3) Propagate the arg_kwarg_vals from the first graph to the second.
4) Use the second graph as the replacement graph.

The strategy is this because we want to extend this to handle
auto_functionalized later up in the stack.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148046
Approved by: https://github.com/eellison
2025-04-02 13:17:52 +00:00
03138733ba [AOTI] Emit Triton kernels as comment (#150188)
Summary: Emit the corresponding Triton kernel code as comment in each call_triton_ wrapper function, for easier debugging.

Differential Revision: [D72178907](https://our.internmc.facebook.com/intern/diff/D72178907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150188
Approved by: https://github.com/yushangdi
2025-04-02 12:41:54 +00:00
75f38dfd4e cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)
Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes #142005.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350
Approved by: https://github.com/desertfire
2025-04-02 09:54:27 +00:00
3f54b14c75 [CUDAGraph] support meta tensor (#150478)
Previously, cudagraph is skipped if the graph contains any meta tensor. However, we should not skip since meta tensor does not have actual computation. This PR fixes the issue.

### Example

```python
import torch

def foobar(x, y):
    return x * 2, y * 3

foo_c = torch.compile(mode="reduce-overhead")(foobar)
t = torch.empty((1, 16, 128, 128), device="meta")
y = torch.rand([64], device="cuda")

eager_out = foobar(t, y)

for _ in range(3):
    compiled_out = foo_c(t, y)
```

Prior to this PR, above code leads to
```
skipping cudagraphs due to multiple devices: device(type='cuda', index=0), device(type='meta')
```

With this PR, we don't skip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150478
Approved by: https://github.com/eellison
2025-04-02 07:21:50 +00:00
0da8127f77 Compare device name of profiler dynamically (#150396)
Compare self.use_device of torch.autograd.profiler.profiler with _get_privateuse1_backend_name(), since privateuse1 backend can be renamed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150396
Approved by: https://github.com/sraikund16
2025-04-02 06:06:06 +00:00
c65de03196 Add Any return annotation to __getattr__ methods that return a union of types. (#150204)
Adds an `Any` return type annotation to `__getattr__` methods in `torch/_ops.py` that return a union of types. Attribute access returning a union of types can cause issues downstream because consumers would need to handle all of the possible types to make the type checker happy. This doesn't seem to matter today for mypy, presumably because `Any` is always inferred when a return type annotation is missing, but it still makes explicit what mypy is already doing implicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150204
Approved by: https://github.com/malfet
2025-04-02 05:25:07 +00:00
dee016ceb7 [MPSInductor] Add store_reduce method (#150457)
That restrict the store operation to 0th thread, which should be much better, shouldn't it
(Though I don't observe it in the benchmark)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150457
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #150452
2025-04-02 05:12:49 +00:00
3ac5a499dd [dynamo] add dynamo disable reasons to codebase (#150440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150440
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #150341
2025-04-02 04:26:48 +00:00
25eff6e991 [dynamo] add reason field to torch.compiler.disable (#150341)
Implements https://github.com/pytorch/pytorch/issues/146445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150341
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-04-02 04:26:48 +00:00
063ea5d669 [AOTInductor] Modify test for Memory tracking for memory-related (#150269)
operations

Summary:
Fix the test for memory tracking. This PR does:
(1) Add tracking before and after for all memory-related operations.
Make sure the operation do indeed captures memory both in CUDA and
torch's CUDACachAllocator Make sure the operation do indeed captures
consumed memory both in CUDA and torch's CUDACachAllocator.
(2) Keep track of memory being reserved by CUDACacheAllocator in
torch and it's relationship with global CUDA memory consumption.

Test Plan:
This PR is adding tests.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150269
Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire
2025-04-02 04:18:18 +00:00
5734909f34 [Profiler] Fix Empty C Call Queue (#150370)
Summary:
My commandeer of https://github.com/pytorch/pytorch/pull/150102

Based on description of PR it seems that we need to add C calls for each starting python event with a callable such that when the tracing exits we will have a matching enter for any given exit. It adds some unnecessary events at worst but prevents segfaults/failures. My PR just cleans up some refcount impl and logging.

Test Plan: Ran resnet test internally. Will check CI and ask reviewers to make sure it resolves their issues.

Differential Revision: D72207570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150370
Approved by: https://github.com/aaronenyeshi
2025-04-02 02:44:50 +00:00
eqy
f09513e515 [CUDA]][SymmetricMemory] Interpret empty string as std::nullopt in rendezvous (#149793)
this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`.
e.g.,
9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)

this currently breaks `test_intra_node_comm_all_reduce`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-04-02 02:41:07 +00:00
61ebe999cc [invoke_subgraph] Do not cache fake tensors for AOTDispatcher first pass (#150450)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150450
Approved by: https://github.com/zou3519
ghstack dependencies: #150082
2025-04-02 02:31:54 +00:00
b060fedfa8 [invoke_subgraph] Support None in the fwd output (#150082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150082
Approved by: https://github.com/zou3519
2025-04-02 02:31:54 +00:00
0ae75ca2de assert on all_reduce_event only if it's not CPU device. (#150316)
Summary: For CPU based runs, `all_reduce_event` would be None since this is the result of the `all_reduce_stream.record_event()`, which does not do much other than returning None when device type is CPU.

Test Plan: CI

Differential Revision: D72176406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150316
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/mori360
2025-04-02 01:54:35 +00:00
cyy
e872c38eb3 Remove cppcoreguidelines-pro-type-member-init_fix suppression (#148638)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148638
Approved by: https://github.com/zou3519
2025-04-02 01:33:20 +00:00
c974b5322a enable torch.compile for torch._scaled_mm nvfp4 recipe (#150462)
Summary:

Updates the meta registration for `torch._scaled_mm` to work for the
nvfp4 recipe.

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_blockwise_nvfp4
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150462
Approved by: https://github.com/eellison
2025-04-02 01:08:40 +00:00
ee97299961 [MPS][Testing] Benchmark reduction ops (#150452)
That compares eager vs compile
On my M4Pro mini I'm getting the following now
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      sum (torch.float32)  |      121.0      |       201.5       |       130.3       |        772.3        |       179.4       |        1470.5       |        476.1      |        2980.0
      max (torch.float32)  |      154.1      |       165.9       |       198.7       |        211.6        |       344.2       |         386.9       |       1326.6      |        1345.6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150452
Approved by: https://github.com/dcci, https://github.com/manuelcandales
2025-04-02 01:06:27 +00:00
db32093192 [ROCm][Windows] Fix torchvision build with ROCm 6.4 on windows (#150180)
Since with HIP SDK 6.4 hipcc files and calls and restructured, the case for calling hipcc.exe is added in case of building torchvision with HIP SDK 6.4 on Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150180
Approved by: https://github.com/malfet, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-02 00:35:47 +00:00
d22e3d5efe [fr] Add logger config for flight record in PGNCCL (#150356)
Summary: We want to move from a scuba based direct logging to a logger config based logging. Mostly changes are internal but we need to change the exception to exception_msg.

Test Plan: Following https://www.internalfb.com/wiki/Server_Logging/Getting_Started_with_Logging/Onboarding_Existing_Scribe-Based_Logging_(Alpha)/ to test it.

Differential Revision: D72198171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150356
Approved by: https://github.com/fegin
2025-04-01 23:54:07 +00:00
6aea4d90fb gloo: use shared Stores (#150230)
Summary:
X-link: https://github.com/facebookincubator/gloo/pull/423

This modifies `connectFullMesh` to take in a shared_ptr<IStore> instead of a reference. This is an API breaking change but fairly easy to work around.

To have backwards compatibility in PyTorch during the commit phase we add a new ifdef `GLOO_SHARED_STORE` which can provide backwards compatibility until we update the pinned Gloo version in pytorch OSS repo.

This also adds a new `wait_get` method to `IStore` which will allow us to do a more efficient operation in PyTorch TCPStore. PyTorch's `Store::get` automatically waits so we want to make sure we can avoid waiting twice to reduce network traffic.

This change will land simultaneously in PyTorch and Gloo repos.

Test Plan:
```
buck2 test //gloo/... //caffe2/caffe2/contrib/gloo:
```

Differential Revision: D72084111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150230
Approved by: https://github.com/fduwjj
2025-04-01 23:37:25 +00:00
4934a83347 [AMD] [TRITON] [INDUCTOR] Add tl.assume to enable bufferops on AMD (#150373)
Summary: Update the GEMM template to include the necessary `tl.assume` annotations to enable bufferops with AMD.

Test Plan: Tested manually with a simple matmul run with torch.complie(f, mode="max-autotune") the environment variables TRITON_ALWAYS_COMPILE=1 AMDGCN_ENABLE_DUMP=1 AMDGCN_USE_BUFFER_OPS=1.
Inspecting the generated AMDGCN all loads/stores use bufferops.
Note: Since inductor is loading constants for many of the shape values assumes are generally not needed for the stride/shape information, but pid calculations are generally a gap in Triton's inference capability.

Differential Revision: D71922698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150373
Approved by: https://github.com/eellison
2025-04-01 23:29:39 +00:00
60fe0922f6 [pytree] Register normal class to register_dataclass (#147752)
Fixes https://github.com/pytorch/pytorch/pull/147532#discussion_r1964365330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147752
Approved by: https://github.com/zou3519
2025-04-01 23:28:20 +00:00
203a27e0ce Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 8f7fbe3d7d2cd301df48fcbe8a14f8aa1a9c1e48.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/clee2000 due to reverted internally by D72140190 ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2770874244))
2025-04-01 23:07:28 +00:00
80ab233786 [Inductor] Hide reinplace_fsdp_all_gather pass behind skip_fsdp_hooks config (#150436)
The `reinplace_fsdp_all_gather` pass is currently only for Traceable FSDP2 and doesn't work together with SimpleFSDP. We should hide the pass behind `skip_fsdp_hooks` config which makes it only apply to Traceable FSDP2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150436
Approved by: https://github.com/BoyuanFeng
2025-04-01 22:56:06 +00:00
9458460211 Revert "if blaslt fails, fall back to blas (#150147)"
This reverts commit 65139eb050817329ac8e541c377b2be3bb5ffe14.

Reverted https://github.com/pytorch/pytorch/pull/150147 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150147#issuecomment-2770847320))
2025-04-01 22:52:22 +00:00
76e1b3ba4c Revert "[ROCm] use correct workspace for hipblaslt, silence warning (#150227)"
This reverts commit c158eac0de2afe38d68952ca401888ed5777f6b0.

Reverted https://github.com/pytorch/pytorch/pull/150227 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150227#issuecomment-2770827563))
2025-04-01 22:31:13 +00:00
629c1bd2dd [ez][inductor][tests] Skip triton backend only for CPU tests (#150343)
Motivation: to unblock https://github.com/pytorch/pytorch/pull/148622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150343
Approved by: https://github.com/chenyang78
2025-04-01 22:03:48 +00:00
b70d105c77 infer dynamic shapes through additional inputs (#150144)
Summary:
Instead of explicitly specifying dynamic shapes, it is possible to infer them from additional example inputs. Together with the example inputs provided to export, we can basically make any varying dim dynamic and keep any fixed dim static. This should be useful for prod scenarios that have access to tests and/or profiling data, yet are somewhat removed from the model authoring process.

However this alone is not satisfactory: the exported program by design has only one graph, representing one path through the model, and we cannot necessarily guarantee that this graph works for the additional example inputs because different guards might have been created if we had exported with them instead (corresponding to different traced paths). However, checking that the additional example inputs satisfy the guards created by the original export should be sufficient for generalization.

Now, while we don't preserve all guards in the exported program, we do check a subset of them as part of input matching. So we add a verification step at the end of export when such additional example inputs are provided. This should be enough for now.

Test Plan: added test (positive and negative cases)

Differential Revision: D72001771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150144
Approved by: https://github.com/bobrenjc93
2025-04-01 21:13:39 +00:00
0d44a8aea1 [Hierarchical Compile] Apply deduplication after output node creation (#150306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150306
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303, #150304, #150305
2025-04-01 20:54:18 +00:00
8740ffa760 [Hierarchical Compile] Add cycle detection to graph region expansion (#150305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150305
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303, #150304
2025-04-01 20:54:18 +00:00
a2300aff94 [Hierarchical Compile] Add cycle detection function for debug (#150304)
Remove print

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150304
Approved by: https://github.com/anijain2305
ghstack dependencies: #150303
2025-04-01 20:54:10 +00:00
99fd96c10b [Hierarchical Compile] Remove spammy debug log (#150303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150303
Approved by: https://github.com/williamwen42
2025-04-01 20:54:03 +00:00
295162ec3a Smoke Test - disable pypi package validation for binaries that package cuda libs (#150194)
Smoke Test - disable pypi package validation for binaries that package cuda libs. These binaries do not install packages via pypi.
Should Resolve this from `linux-binary-manywheel / manywheel-py3_11-cuda12_6-full-test / test`:
```
Traceback (most recent call last):
  File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 468, in <module>
    main()
  File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 462, in main
    smoke_test_cuda(
  File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 274, in smoke_test_cuda
    compare_pypi_to_torch_versions(
  File "/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 220, in compare_pypi_to_torch_versions
    raise RuntimeError(f"Can't find {package} in PyPI for Torch: {torch_version}")
RuntimeError: Can't find cudnn in PyPI for Torch: 9.5.1
```
Link: https://github.com/pytorch/pytorch/actions/runs/14101221665/job/39505479587#step:15:982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150194
Approved by: https://github.com/ZainRizvi
2025-04-01 19:18:44 +00:00
d2ad9aa2f2 [dtensor][tp] add a ParallelStyle PrepareModuleInputOutput (#150372)
Needed this class for because `parallelize_module` takes a dict, which doesn't allow `PrepareModuleInput` and `PrepareModuleOutput` to be applied at the same time.

The `PrepareModuleInputOutput` in this PR initializes two variables `prepare_module_input` and `prepare_module_output` and uses them to process module / inputs / outputs.

I had another implementation which put all code in `PrepareModuleInputOutput` and let `PrepareModuleInput` and `PrepareModuleOutput` inherit the monolithic `PrepareModuleInputOutput`. But it is
1. less cleaner
2. conceptually abusing inheritance because `PrepareModuleInput` shouldn't be able to access class methods of `PrepareModuleOutput` and vice versa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150372
Approved by: https://github.com/wanchaol
2025-04-01 19:15:43 +00:00
5d6ac2dced [dtensor] add op support for select_backward and slice_backward (#150357)
Inheriting and rebasing @awgu 's PR https://github.com/pytorch/pytorch/pull/149071
- fixed an issue for `select_backward` and an issue for `slice_backward`
- removed `_experimental_ops.py` as it becomes empty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150357
Approved by: https://github.com/awgu, https://github.com/XilunWu
2025-04-01 19:15:25 +00:00
a37afd23fa [custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555)
(benchmark for 1 call)

Before:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

After:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555
Approved by: https://github.com/zou3519
2025-04-01 18:45:48 +00:00
78300c8205 [ROCm] update test buffer fudge factor for hipblaslt (#150348)
The default workspace for hipblaslt is larger than for cublas/cublaslt which requires a slight increase to the buffer needed.

Forward-fix for #150227 that broke ROCm distributed tests but wasn't part of initial CI signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150348
Approved by: https://github.com/jeffdaily
2025-04-01 18:31:25 +00:00
37ebb0b56a [inductor] Fix inductor windows linker error (#150256)
Fixes #149889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150256
Approved by: https://github.com/anijain2305, https://github.com/eellison
2025-04-01 18:30:55 +00:00
15dbad2115 Update torch.compile issue template (#150192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150192
Approved by: https://github.com/malfet
ghstack dependencies: #149947
2025-04-01 18:16:16 +00:00
f04cf13bdd Revert "Merge Triton ScaledMM as epilogue to MM template (#150045)"
This reverts commit 981048854da154eae8ff0bd439e72e1256ae00da.

Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))
2025-04-01 17:54:28 +00:00
b0c560ef2a [dynamo][hooks] use wrap_top_frame config for functions (#150209)
When torch.compile is applied to a module via `mod.compile(...)`, it's equivalent to `torch.compile(mod._call_impl)` which takes a different path than `OptimizedModule`. This PR ensures that the `wrap_top_frame` config can also take effect for the `torch.compile(mod._call_impl)` use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150209
Approved by: https://github.com/anijain2305
2025-04-01 17:41:23 +00:00
48af2cdd27 [BE] Move all lint runner to 24.04 (#150427)
As Ubuntu-20 reached EOL on Apr 1st, see https://github.com/actions/runner-images/issues/11101
This forces older python version to be 3.8
Delete all linux-20.04 runners from the lintrunner.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150427
Approved by: https://github.com/seemethere
2025-04-01 17:33:15 +00:00
3b0cd9b542 [Quant][PT2E] add a lowering pass for x86 backend (#149708)
**Summary**
This PR adds a lowering pass for x86 backend
- Patterns of `dequantize -> conv/linear (-> quantize)` are fused to corresponding quantized onednn ops.
- Weights are prepacked ahead of time.
- Post ops of conv/linear are fused if supported.
- The pass returns a `GraphModule` with the modifications mentioned above.

**Test plan**
```
pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_lowering_to_x86
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149708
Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel
2025-04-01 17:32:41 +00:00
783f045c4f [ez] Remove dead lite interpreter CI code (#150424)
There are no lite-interpreter build environments in CI

I assume every mac build is arm64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150424
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-04-01 17:14:32 +00:00
a17ee8181a [CI] Fix log artifact not containing test logs attempt 2 (#150234)
Fixes #ISSUE_NUMBER
Take two of https://github.com/pytorch/pytorch/pull/149577 since it didn't work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150234
Approved by: https://github.com/malfet, https://github.com/seemethere
2025-04-01 17:13:58 +00:00
f94ac263af [MPSInductor] Fix neg for unsigned types (#150412)
By more-or-less copy-n-pasting the fix from https://github.com/pytorch/pytorch/pull/94035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150412
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #150382, #150386
2025-04-01 16:52:41 +00:00
ae74ef9d53 Set proper LD_LIBRARY_PATH on Linux in nightly venv in nightly pull tool (#143262)
Before this change:

```console
$ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.12"
$ source venv/bin/activate
$ python3 -c 'import torch'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/PanXuehai/Projects/pytorch/torch/__init__.py", line 379, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
```

This PR adds `site-packages/nvidia/**/lib` to `LD_LIBRARY_PATH` in `venv/bin/activate` script to let NVIDIA PyPI packages can be loaded correctly.

See also:

- #141837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143262
Approved by: https://github.com/malfet
2025-04-01 16:51:02 +00:00
a19b667bca [ROCm] Update CUDAPluggableAllocator.h (#1984) (#150010)
Altering the flag to use the correct streamType in CUDAPluggableAllocator class for ROCm gpu. The flag TORCH_HIP_VERSION does not work for ROCm as intended. This flag is replaced with USE_ROCM. This is impacting Distributed Fused Adam in Rocm/APEX when using nccl_ub feature. This has been tested with rocm/apex.

See PR https://github.com/ROCm/apex/pull/184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150010
Approved by: https://github.com/jeffdaily
2025-04-01 16:49:03 +00:00
35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
7382654ebc Update ExecuTorch pin to latest viable/strict 3/28/2025 (#150308)
From latest viable/strict: https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50

Fixes https://github.com/pytorch/pytorch/issues/144480

This commit has important CI stability fixes, such as https://github.com/pytorch/executorch/pull/9561 and https://github.com/pytorch/executorch/pull/9634
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150308
Approved by: https://github.com/jathu, https://github.com/malfet
2025-04-01 16:30:09 +00:00
428234bc28 [MPSInductor] torch.complex128 is unsupported on MPS (#150386)
Same as torch.float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150386
Approved by: https://github.com/dcci
ghstack dependencies: #150382
2025-04-01 15:19:10 +00:00
1c6e88eb03 [MPS] Test bf16 perf of few unary and binary ops (#150382)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150382
Approved by: https://github.com/Skylion007
2025-04-01 13:58:20 +00:00
0d96c38b76 [AOTI] Skip test_buffer_mutation_and_force_mmap_weights for fbcode (#150340)
Summary: Skip due to an older ideep version

Differential Revision: D72190746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150340
Approved by: https://github.com/yushangdi
2025-04-01 13:24:21 +00:00
84c21d2147 Enable SVE ACLE implementation for tanH Aten op for FP32 dType. (#143741)
In deep learning models, the tanh (hyperbolic tangent) function is a widely used activation function, primarily in feedforward networks, recurrent neural networks (RNNs), and various other architectures.

Also, the tanh (hyperbolic tangent) function is commonly used in **Physics-Informed Neural Networks (PINNs).** PINNs are a class of machine learning models designed to solve partial differential equations (PDEs) by incorporating the governing physics directly into the loss function, along with data-driven terms.

In PINNs, activation functions like tanh are used in the neural network architecture to enable the model to learn complex mappings between inputs (such as spatial and temporal coordinates) and outputs (such as field variables).

**Operator: tanh()**
**Current Implementation in OSS in ATen Backend:**
**SVE Flow:** Uses SVE sleef when available else std implementation.

**With this PR :**
**SVE Flow:** Uses SVE ACLE implementation. (Faster Implementation)

**Here are the performance improvements.**
**Single core perf numbers:**
![image](https://github.com/user-attachments/assets/c2f4bcb6-11bc-4af1-b5eb-278a4cc4a69d)

**Metric:**  CPU time avg time per iteration (In ms)

As you can see with both gcc and clang compilers, we see a significant performance gain with SVE ACLE implementation over current OSS Implementation (Sleef) and also Neon.

**Hardware:** m7g.8xlarge (Graviton 3 Instance)

**Script used in benchmarking:**
```python
import os
#os.environ["ATEN_CPU_CAPABILITY"] = "default"
os.environ["ATEN_CPU_CAPABILITY"] = "sve256"

import torch
import torch.nn as nn

#Set the random seed for reproducibility
torch.manual_seed(1)

#Create a tensor of shape (8521, 50)
x = torch.randn(8521, 50)

for i in range(10):
output = x.tanh()

#Perform the tanh operation 1000 times and profile the performance
print("### CPU tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(1000):
output = x.tanh()

#Print the profiling results sorted by self CPU time
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

#Optionally print the final output (if needed, uncomment the following line)
print(output)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143741
Approved by: https://github.com/malfet
2025-04-01 11:54:58 +00:00
bf4814eb6a [Intel GPU] Allow XPU backend in Quantize operators (#150288)
This modification is to support torch.quantize_per_channel() on XPU, otherwise it will cause a segmentation fault.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150288
Approved by: https://github.com/jerryzh168, https://github.com/guangyey
2025-04-01 11:27:26 +00:00
a10b765bf1 [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-04-01 10:40:43 +00:00
48e9ffc873 Unify on dynamo_compile as the overall wait counter (#150293)
Summary:
dynamo_compile for the most part has been accounting for compile time except autotuning.

all_compilation_types had earlier been injected on fx_codegen_and_compile, which was incorrect.

Add autotuining to dynamo and deprcate all_compilation_types counter.

Differential Revision: D72145447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150293
Approved by: https://github.com/masnesral, https://github.com/jamesjwu
2025-04-01 08:55:51 +00:00
36f2d0aaba Add "xpu" to __all__ for torch/version.py (#149695)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149695
Approved by: https://github.com/desertfire, https://github.com/guangyey
2025-04-01 08:44:51 +00:00
1700599266 Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)
Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129
Approved by: https://github.com/xw285cornell
2025-04-01 05:36:43 +00:00
414b9ae016 enable out variant of 2-shot reduction (#150153)
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
2025-04-01 05:36:04 +00:00
7e7e5698cc Suppress more warnings (#149833)
Differential Revision: [D71702307](https://our.internmc.facebook.com/intern/diff/D71702307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149833
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-04-01 05:33:04 +00:00
790d459f85 [dynamo] add error message for unsupported LOAD_BUILD_CLASS (#150323)
Improved error message for https://github.com/pytorch/pytorch/issues/128942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150323
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-04-01 05:03:50 +00:00
ce52674b76 [Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#148863)
We found that the `pip install cmake` and `conda install cmake` has different behavior.
The reason is that the pip installed one doesn't find the corresponding libs under conda env. So we need to set the `CMAKE_PREFIX_PATH` for alignment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148863
Approved by: https://github.com/CuiYifeng, https://github.com/malfet

Co-authored-by: Cui, Yifeng <yifeng.cui@intel.com>
2025-04-01 04:43:11 +00:00
31634b8c6a [fr] Added protection against missing stack frames in fr cont. (#150133)
Summary: Previously we had D70358287, which didn't fully resolved the issue.

Test Plan:
# FR
`buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks`
Confirm no error
# FR analyzer
`buck2 run @//mode/opt //investigations/dr_patternson/analyzers/ai_observability:ai_observability-all-analyzers-cli -- flight_recorder_analyzer --mast_job_name f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0`
Confirm no error

Differential Revision: D71998980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150133
Approved by: https://github.com/fduwjj
2025-04-01 03:07:59 +00:00
827b730f4e [CI] Skip test_copy_large_tensor on M2-15 runners (#150377)
They have more than 12Gb memory, but may be running this test causes OOM in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150377
Approved by: https://github.com/atalman
2025-04-01 02:33:43 +00:00
6470b373c1 torch.backends.mkldnn.flags() CM should not warn (#150358)
By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined

Added regression test

Fixes https://github.com/pytorch/pytorch/issues/149829

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358
Approved by: https://github.com/atalman
2025-04-01 01:33:40 +00:00
5cb5675f13 [Inductor] optimize the heuristics of parallel reduction (#149614)
Fix https://github.com/pytorch/pytorch/issues/148639.

Summary:
Optimize the heuristics of parallel reduction: When the number of steps of the first inner loop beyond the maximum parallel depth is much larger than the number of steps of all outer loops within the maximum parallel depth, change the starting depth of parallelism to the first inner loop and recalculate the maximum parallel depth. I ran the Inductor benchmark with this PR on CPU. A timm model poolformer_m36 BF16 has about 25% performance improvement, and no performance regression is seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149614
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-04-01 01:31:00 +00:00
0f12951fc2 [Intel gpu] always set deterministic for xpu accuracy test (#149028)
On Intel Max 1550, models like Super_SloMo can actually pass accuracy test after set deterministic, because we do not use atomic in upsampling bilinear backward in some cases when running on XPU. Furthermore, I guess the only reason not to set deterministic on these models is just avoiding errors. We should use warn_only = True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149028
Approved by: https://github.com/guangyey, https://github.com/desertfire

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-01 01:00:11 +00:00
7ab8532cf1 [BE] Get rid of cross-compile and x86 build options for Mac (#150362)
As both cross-compilation and x86 builds has been removed a while back

Remove stale TODO about building with OpenMP support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150362
Approved by: https://github.com/atalman, https://github.com/clee2000
2025-04-01 00:45:24 +00:00
4ce0b959ff Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261)
Fixes #143071

Operations performed on tensors with `requires_grad=True` such as
```python
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
```
and
```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.pow(x,3)
```
are valid operations.

While an operation using `numpy` like
```python
import numpy as np

x = torch.tensor(2.0, requires_grad=True)
y = np.pow(x,3)
# > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
```
leads to an error.

However, an operation that uses `math` like
```python
import math

x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
```
does not cause an error, and `y` is no longer a tensor with a gradient!

This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models.

To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with:
```python
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
# > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
# Consider using tensor.detach() first.
```

Please let me know if you have any questions 👍
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261
Approved by: https://github.com/malfet

Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-01 00:42:46 +00:00
49b7d0d84d [ROCm] Enable more inductor UTs (#149513)
Primarily enable inductor fp8 tests, also enable other inductor tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149513
Approved by: https://github.com/jeffdaily
2025-04-01 00:30:36 +00:00
c75dac5f5c Fix typo (#150363)
Fixes https://github.com/pytorch/pytorch/issues/150339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150363
Approved by: https://github.com/atalman, https://github.com/kwen2501
2025-03-31 23:58:37 +00:00
b48505a8a1 [MPS] Add support for hermite_polynomial_h. (#150279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150279
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-03-31 23:30:19 +00:00
a2070e2fd5 [AOTInductor] Free tensors in test (#150274)
Summary:
This PR frees tensor that were new-ed within the test itself to prevent
memory leak.

Test Plan:
Fixing tests itself.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150274
Approved by: https://github.com/chenyang78
2025-03-31 23:28:13 +00:00
982a7f7db0 [cachinghostallocator] remove the check on cudaHostRegister path (#150070)
Summary:
In the cudaHostAlloc path, the flag we used is `cudaHostAllocDefault` [0] which don't really have this strict enforcement (devicePtr retrieved from ` cudaHostGetDevicePointer(()` point to the same addr as the hostPtr) according to the guide [1]. This diff removes the check so that the host register path works for ROCm.

[0]6aca002d82/aten/src/ATen/cuda/CachingHostAllocator.cpp (L97)
[1] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902

Test Plan: test_pinned_memory_with_cudaregister tests

Differential Revision: D71932562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150070
Approved by: https://github.com/jeffdaily
2025-03-31 23:23:05 +00:00
981048854d Merge Triton ScaledMM as epilogue to MM template (#150045)
Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor.

Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done.
TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045
Approved by: https://github.com/drisspg
2025-03-31 23:20:14 +00:00
91666eef60 Update gloo submodule (#150320)
That updates its CMake minimum version(via https://github.com/facebookincubator/gloo/pull/424 ) and removes cmake-4.0.0 workarounds for gloo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150320
Approved by: https://github.com/atalman
2025-03-31 22:40:27 +00:00
1526ff955e Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261)"
This reverts commit 515b45e5693dbf9dd58d8472806cbe5f49e43074.

Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/clee2000 due to failing internal tests D72135661 ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2767531682))
2025-03-31 22:19:08 +00:00
423e4a4568 [ROCm] cmake 4 workaround for hiprtc (#150324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150324
Approved by: https://github.com/jeffdaily, https://github.com/atalman, https://github.com/malfet
2025-03-31 21:55:53 +00:00
4e2997db73 [ROCm][CI] Increase wheel build timeout from 210 to 240 (#150221)
Fixes #150046.  Increasing the timeout from 210 to 240.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150221
Approved by: https://github.com/jeffdaily
2025-03-31 21:46:09 +00:00
925fd4aa2e [export] min/max ranges for dim hints (#149590)
Differential Revision: D71522032

Adds min/max ranges to Dim.AUTO/DYNAMIC/STATIC, so users can do `Dim.AUTO(min=2, max=2048)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149590
Approved by: https://github.com/tugsbayasgalan
2025-03-31 21:32:20 +00:00
dfcd98e684 cd: Fix naming for windows arm64 libtorch builds (#150310)
Apparently the magical incantation to name these correctly lies in the
build_variant variable otherwise it silently does nothing.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150310
Approved by: https://github.com/atalman
2025-03-31 20:12:03 +00:00
80b7f6b704 Adjust TestInductorOpInfo to depend on backend, not device (#146911)
As is the case with many inductor tests, this test adapts test criteria based on device type, where it should be adjusting for the backend registered for that device.

In this particular case, using the upstream triton CPU backend would lead to failures, as reference_in_float would be true as this is required for the C++/OpenMP backend which does not have float16 support. However most triton backends do, and as such should be tested in float16. Similarly a triton backend with a device not described as a GPU would get skipped from testing entirely.

A more generic solution would be ideal, but this would require a lot of work across many tests.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146911
Approved by: https://github.com/masnesral
2025-03-31 18:24:16 +00:00
ab342d3793 Make PyTorch buildable by CMake-4.x on s390x (#150294)
This is a continuation of
https://github.com/pytorch/pytorch/pull/150203
that fixes nightly build on s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150294
Approved by: https://github.com/malfet
2025-03-31 18:10:02 +00:00
5e34758cef [invoke_subgraph] Support unbacked (#149298)
Differential Revision: [D71420641](https://our.internmc.facebook.com/intern/diff/D71420641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149298
Approved by: https://github.com/zou3519
2025-03-31 17:25:09 +00:00
284b766898 [dynamic shapes] C++ bindings for guard_or_false/true (#150148)
C++ version. Would like to add it in one place to prove it works, but couldn't find one that doesn't expose a chain of data-dependent changes... so just gonna put up the base implementation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150148
Approved by: https://github.com/laithsakka, https://github.com/jingsh
2025-03-31 17:04:25 +00:00
47cdad2995 [ROCm] Enable several fsdp related UTs (#149369)
Enabling 26 UTs for ROCm in the following files:

-  distributed._shard.sharded_optim.test_sharded_optim - 2 UTs
-  distributed._shard.sharded_tensor.ops.test_binary_cmp - 4 UTs
-  distributed._shard.sharded_tensor.ops.test_init - 3 UTs
-  distributed._shard.sharded_tensor.ops.test_embedding - 2 UTs
-  distributed._shard.sharded_tensor.ops.test_embedding_bag - 2 UTs
-  distributed._composable.test_replicate_with_compiler - 4 UTs
-  distributed._composable.fsdp.test_fully_shard_grad_scaler - 1 UTs
-  distributed.tensor.test_attention - 4 UTs
-  distributed.tensor.test_matrix_ops - 1 UTs
-  distributed.tensor.test_tensor_ops - 1 UTs
-  distributed.fsdp.test_fsdp_grad_acc - 2 UTs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149369
Approved by: https://github.com/jeffdaily
2025-03-31 16:15:57 +00:00
7c858066ae Revert "Enable TMA persistent GEMM Template by default (#149427)"
This reverts commit b8ef642f04874e13a9f2771902ddb7514f294015.

Reverted https://github.com/pytorch/pytorch/pull/149427 on behalf of https://github.com/clee2000 due to failing tests internally D72116141 ([comment](https://github.com/pytorch/pytorch/pull/149427#issuecomment-2766672200))
2025-03-31 15:58:34 +00:00
57fa99c5c3 Revert "enable out variant of 2-shot reduction (#150153)"
This reverts commit cdeb32d2d1c31b60c65133e83510977c5c180005.

Reverted https://github.com/pytorch/pytorch/pull/150153 on behalf of https://github.com/clee2000 due to failing internal builds D72083877 ([comment](https://github.com/pytorch/pytorch/pull/150153#issuecomment-2766633712))
2025-03-31 15:43:24 +00:00
e57fa18b40 Revert "Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)"
This reverts commit 8a872261dcb3797557d1965af6832677a77efec1.

Reverted https://github.com/pytorch/pytorch/pull/150129 on behalf of https://github.com/clee2000 due to breaking internal builds D72080428 ([comment](https://github.com/pytorch/pytorch/pull/150129#issuecomment-2766619006))
2025-03-31 15:37:54 +00:00
f74d5d576a Update torch-xpu-ops commit pin to 3ee2bd2 (#150300)
Update the torch-xpu-ops commit to [3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5](3ee2bd2f13)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150300
Approved by: https://github.com/EikanWang
2025-03-31 13:36:11 +00:00
bbb9b2476b Unify use of enableCollectiveHashDebug_ and trivial updates (#142865)
Use `enableCollectiveHashDebug_` instead of checking env ad-hoc when `TORCH_DISTRIBUTED_DEBUG = DETAIL`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142865
Approved by: https://github.com/fegin, https://github.com/kwen2501
2025-03-31 12:23:30 +00:00
c158eac0de [ROCm] use correct workspace for hipblaslt, silence warning (#150227)
Follow up to #145130. That PR caused a warning on ROCm the first time hipblaslt was called for any workload, always.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150227
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-31 09:49:43 +00:00
51f0403f46 Update the baseline for max_autotune ci workflow (#149107)
Since the issue https://github.com/pytorch/pytorch/issues/148535 is fixed in PR https://github.com/pytorch/pytorch/pull/148923, update the baseline for max_autotune ci workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149107
Approved by: https://github.com/chuanqi129, https://github.com/leslie-fang-intel, https://github.com/desertfire
2025-03-31 09:45:44 +00:00
4aded85e79 Fix space typo in warning message (#143473)
Warning shows up like this (no space between willbe):
```
/home/xxx/.local/lib/python3.11/site-packages/torch/distributed/fsdp/_state_dict_utils.py:827:
UserWarning: When using ``NO_SHARD`` for ``ShardingStrategy``, full_state_dict willbe returned.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143473
Approved by: https://github.com/mikaylagawarecki, https://github.com/kwen2501
2025-03-31 07:38:02 +00:00
c976321541 Use variadic length tuple for torch.masked.DimOrDims (#149870)
`tuple[int]` means only a tuple of length 1, which is not what was intended.

```python
loss = torch.masked.mean(loss, mask=mask, dim=(-1, -2))  # Argument of type "tuple[Literal[-1], Literal[-2]]" cannot be assigned to parameter "dim" of type "DimOrDims"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149870
Approved by: https://github.com/Skylion007
2025-03-31 07:06:58 +00:00
f1b74037b1 Fix bug when Inductor include path contains spaces (#148271)
This PR fixes a bug with how include directories with spaces are handled on Windows. I ran into an edge case with torch.compile() - it will error out with an exception on Windows. In particular, it will try to execute the following: `cl /I C:/Program Files/Python311/Include ...`, where `C:/Program` will be treated as separate from `Files/Python311/Include`.

I looked into using something like `shlex.quote` or `pathlib.Path`, but I didn't find those options to be suitable (shlex is POSIX shell only, pathlib.Path does not escape spaces).

There is another place in the function that also deals with escaping spaces. My fix follows the same style. 0ff2e6a85a/torch/_inductor/cpp_builder.py (L1464)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148271
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-03-31 06:46:05 +00:00
b99e0c5412 Fix mtia_extension.cpp setDevice() to correctly set current_device (#149398)
We referred to this code and found that there was a minor bug. Fix for future reference for others.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149398
Approved by: https://github.com/janeyx99
2025-03-31 06:07:22 +00:00
4f14224dc8 [Inductor] Fix torch.polygamma() when n == 1 (#147453)
Fixes #147450

Be consistent with cpu kernel:

77dbd28535/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L433-L444)

Got this in the case:

```
Eager: tensor([1.2914e+15]), dtype: torch.float32
Compile: tensor([1.2914e+15]), dtype: torch.float32
Expected: tensor([6.5808e+32], dtype=torch.float64), dtype: torch.float64
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147453
Approved by: https://github.com/eellison
2025-03-31 05:27:46 +00:00
9456738edf [c10d][fr] Allow multiple writer registration with warnings (#150232)
The life span of writer is actually the whole program which is sub-optimal but it is a practical compromise so that the registration of writer can happen outside PG creation.

So we decide to allow multiple writer registrations with warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150232
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-03-31 04:43:43 +00:00
ad54b3aae2 test 0-dim squeeze in basic.TestSqueeze (#147928)
Replace TODO with 0-dim squeeze, checks scalar is unchanged in `basic.TestSqueeze`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147928
Approved by: https://github.com/janeyx99
2025-03-31 04:35:16 +00:00
c3bb174bb2 SubsetRandomSampler - changed iteration over tensor to iteration over list (#149126)
Digging further the problem at https://github.com/UKPLab/sentence-transformers/pull/3261, it boils down to this expensive loop over a torch tensor. Looping over a list, like in RandomSampler, solves the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149126
Approved by: https://github.com/divyanshk, https://github.com/cyyever
2025-03-31 04:33:35 +00:00
59abb8c7a2 Fix documentation build errors caused by unsupported section titles (#150205)
Fixes #150134

Build with `make html` looks OK now:
```shell
reading sources... [100%] torch.compiler_get_started .. xpu
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [ 80%] generated/torch.nn.Softsign .. generated/torch.nn.modules.module.register_module_full_backward_writing output... [ 86%] generated/torch.nn.modules.module.register_module_module_registration_hook .. generated/torch.rwriting output... [100%] generated/torch.xpu.get_rng_state .. xpu
generating indices... genindex done
highlighting module code... [100%] typing
writing additional pages... search done
copying images... [100%] _static/img/torch_cuda_memory/allocator_state_history.png
copying static files... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
build succeeded.

The HTML pages are in build/html.
```

New rendering looks like this:

![image](https://github.com/user-attachments/assets/af7e23a5-9dfd-4cb6-9333-a9e8cfe47ea0)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150205
Approved by: https://github.com/albanD
2025-03-31 04:27:44 +00:00
32afecff8b [PrivateUse1] Impl isBuilt() and isAvailable() (#149594)
Follow-up: #146098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149594
Approved by: https://github.com/albanD
2025-03-31 04:18:38 +00:00
46c8f2e965 Update docstring to match code. (#148455)
Very tiny fix to doc string. Pass grid_size=None results in an Exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148455
Approved by: https://github.com/mikaylagawarecki
2025-03-31 04:16:11 +00:00
ca2ffc23ab [ROCm][TunableOp] Stricter unit tests for online and offline tuning (#150142)
Improvements to unit tests and warnings for unsupported cases in offline tuning. Here are more details:
- Previously we only compared the OpSig for the untuned vs. tuned entries. This was not strict enough so we now compare OpSig+ParamSig.
- The main offline and online UTs are now stricter to make sure we exercise the code paths for the four combinations of transA and transB.
- Offline tuning does not support some tensor shapes. Emit warning and skip tuning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150142
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-31 04:12:08 +00:00
157bff22f7 [Async TP] Fuse matmul-reduce-scatters when reduce scatters have multiple users, and save fused node for backward instead of reduce_scatter node (#149946)
Fixes #149876

## Stack
- [previous PR in stack] https://github.com/pytorch/pytorch/pull/149247

## TL;DR
This PR implements support in async TP for saving the reduce-scatter result for backward, which previously would break the torchtitan AC policies: no AC, per op SAC, and per layer SAC.

## Context
In torchtitan's LLama3 per op SAC policy, we want to save the output of `reduce_scatter` ops for backward, which is useful for TP. The reduce_scatter op is also saved for No AC (since all activations are saved) and per layer SAC (since we save the activations for N full layers, which do contain reduce-scatters for TP.

However, doing this causes incompatibility with Async TP for the AC policies above, for 2 reasons:

1) The graph pattern matching specifically only matches on reduce scatter nodes with 1 user, but reduce_scatter nodes saved for backwards will have 2 users (the 2nd one being the return/output node, which saves it for backward).

2) The subgraph replacement logic which replaces the users of the `wait_tensor` after the reduce-scatter with the new fused node has no mechanism to save the fused_node for backward instead of the reduce-scatter node. This means we cannot directly replace the subgraph, since we can't delete nodes which still have users (in this case, the output node is still using the reduce-scatter node).

To fix this, we do 2 things:

1) Add additional pattern matching logic to also match reduce-scatter nodes with 2 users, so we also perform fusion when reduce-scatter is saved for backward.

2) When replacing the subgraph with the fused node, detect if the reduce-scatter was saved for backward, and if so, save the result of the fused node for backward instead. This enables us to properly erase the subgraph and prevent the memory leak which occurred in #149876

## Other changes
- Continue to throw an error if we don't find any candidate all-gathers or reduce-scatters for fusion (since TP should have both) but DON'T throw an error if we don't fuse any matmul-reduce-scatters. This is because I've found there are actually valid graphs where we do fuse reduce scatters in the forward graph but not the backward graph (in the backward pass there are reduce-scatters but the producer op is an "add" not a mm/scaled_mm).

## Test plan

1. All unit tests are passing
2. Visualized the graphs and verified the fusion is occurring properly.
3. Verified via manual torchtitan runs there is no memory leak / OOM occurring anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149946
Approved by: https://github.com/fegin
2025-03-30 19:05:47 +00:00
cbc0964636 Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
2025-03-30 17:51:11 +00:00
e91f84c87d [BE]: Update cudnn frontend submodule to 1.11.0 (#149759)
Update CUDNN frontend submodule to 11.1.0. Adds some new features like score_mod from flex_attention and adds a lot of bugfixes and new feature knobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149759
Approved by: https://github.com/jansel
2025-03-30 17:14:26 +00:00
515b45e569 Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261)
Fixes #143071

Operations performed on tensors with `requires_grad=True` such as
```python
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
```
and
```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.pow(x,3)
```
are valid operations.

While an operation using `numpy` like
```python
import numpy as np

x = torch.tensor(2.0, requires_grad=True)
y = np.pow(x,3)
# > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
```
leads to an error.

However, an operation that uses `math` like
```python
import math

x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
```
does not cause an error, and `y` is no longer a tensor with a gradient!

This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models.

To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with:
```python
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
# > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
# Consider using tensor.detach() first.
```

Please let me know if you have any questions 👍
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-03-30 11:19:07 +00:00
e8a11f175e [BE] Use auto in MPS codebase more (#150000)
Non-trivial (but still a no-op changes):
- Replace `[mpsGraph broadcastTensor:[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt32] toShape:inputTensor.shape name:nil]` with `[mpsGraph constantWithScalar:1 dataType:MPSDataTypeInt32 shape:inputTensor.shape]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150000
Approved by: https://github.com/dcci, https://github.com/cyyever
2025-03-30 05:35:58 +00:00
005c9b2f4f Fix _Waitcounter decorator and dd backward pass wait counter (#150235)
Summary:
This will log a wait counter with for backward compile and fixes weirdness with nested context managers.

Since the old wait counters added through dynamo_timed were never created with the nesting issue. I am also changing the key nomenclature from `pytorch.dynamo_timed` to `pytorch.wait_counter`. We want to use the same nomenclature, to make it easy to find keys.

Reviewed By: jamesjwu

Differential Revision: D72032055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150235
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-03-30 05:20:12 +00:00
cc58ecceea Move dump location to avoid dumping twice (#150219)
Summary:
If we put the dumping code in codegen, we might get a separate node_mapping dump for the constant folded graph (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L1119).

We move it into compile_fx.py so there's only one node_mapping dump.

Test Plan: CI

Reviewed By: YUNQIUGUO

Differential Revision: D72068715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150219
Approved by: https://github.com/YUNQIUGUO
2025-03-30 03:35:38 +00:00
3140565db6 Update type of create_block_mask to more accurately reflect things (#150244)
Fixes some mypy issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150244
Approved by: https://github.com/drisspg
2025-03-29 21:55:57 +00:00
879a293db8 fix et trace collection of all_to_all (#149485)
![image](https://github.com/user-attachments/assets/1e602dec-24a4-4f47-88c0-9311737e217b)
![image](https://github.com/user-attachments/assets/c48a3273-43fb-4a7f-9341-b90cb6b10785)

fix ET trace collection to all_to_all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149485
Approved by: https://github.com/shengfukevin, https://github.com/kwen2501
2025-03-29 20:17:24 +00:00
965784eb9b [MPSInductor] Specify max_total_threads_per_threadgroup (#150247)
When generating reduction kernel, otherwise compiler can unroll loops too much that kernel could not be launched for the intended threadgroup size

Extend `c10:🤘:max` to accept different dtypes

Together this fixes `test_large_broadcast_reduction`

TODO:
  - Explore different threadgroup_sizes for best perf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150247
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #150246
2025-03-29 19:37:15 +00:00
52135db69a [BE] Fix signed/unsigned comparison warning (#150246)
One will see them only if compilation fails, but still
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150246
Approved by: https://github.com/cyyever, https://github.com/jansel
2025-03-29 15:12:42 +00:00
3b00ff8850 Revert "[Profiler] Give non-zero default values to start events (#149757)"
This reverts commit bc72420bcb37390af3fced885e019903e6e425bd.

Reverted https://github.com/pytorch/pytorch/pull/149757 on behalf of https://github.com/malfet due to Broke windows builds, which were also the signal on the HUD ([comment](https://github.com/pytorch/pytorch/pull/149757#issuecomment-2763461365))
2025-03-29 15:08:55 +00:00
f3c77b2458 Set requires grad in TensorMaker::make_tensor() (#148255)
Fixes #146419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148255
Approved by: https://github.com/soulitzer
2025-03-29 08:06:42 +00:00
b8ef642f04 Enable TMA persistent GEMM Template by default (#149427)
Previously, this was unable to be landed given there was limited H100 for CI testing. Benchmarking on H100 CI looks good now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149427
Approved by: https://github.com/drisspg
2025-03-29 07:32:42 +00:00
bc72420bcb [Profiler] Give non-zero default values to start events (#149757)
The intent of the existing code is to

> // Assign system TIDs to start events based on the system TID of the next
    // observed event with the same Python TID.

However, if there are start events that don't share the same Python TID as later observed events, then they are left with the default initialization of DeviceAndResource and assigned values of `0`. This is problematic because Kineto uses `device=0, resource=0` for the first GPU (or other backend) device.

This PR maintains the previous logic of using TIDs from later events if any are present, but defaults to the current process and system thread IDs if there aren't later events to reference.

This issue was discovered while working to implement a custom backend and some CPU start events were appearing on the same process and thread as the device in the trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149757
Approved by: https://github.com/sraikund16
2025-03-29 06:29:25 +00:00
ec6fa547a1 Remove unnecessary "special linking" for BLAS_LIBRARIES (#145487)
Remove the "special linking" that involves listing `BLAS_LIBRARIES` thrice if `TH_BINARY_BUILD` is set, as it should not be any different from listing it just once.

The code seems to date back to commit cfcf2af95f91a88ec61cbcac8b30a718e7332aa5. The original code already listed `BLAS_LIBRARIES` thrice, but it provided no explanation for doing that — and without `TH_BINARY_BUILD`, BLAS was not linked at all.  The current version seems to originate in d6a8d28d6529a4f0b80a8c046ca9c36ca6c8b347 — and it already provided an `ELSE` clause listing `BLAS_LIBRARIES` only once.  From this, I suspect that it is probably an unnecessary leftover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145487
Approved by: https://github.com/malfet
2025-03-29 05:13:22 +00:00
2c9e07ecd2 [BE] Remove outdated RPC benchmark (#146716)
We have lots of outdated unused + uncalled code in our codebase, namely in our benchmarks and examples folders among others. The last change to this directory was 4 years ago and this code looks dead. cc @albanD @H-Huang for feedback

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146716
Approved by: https://github.com/Skylion007, https://github.com/H-Huang
2025-03-29 04:44:36 +00:00
beea76020b Removed ROCM ifdef that governs thread count + smem parallel reduction. (#149779)
#149548 Fixed the arbitrarily missing parallelism for NLL, but they also added an arbritrary #ifdef ROCM guard around this fix to prevent its use on CUDA gpus. There is also a problem with the way the kernel does the reduction from the intermediate shared memory, using only thread 0 walking linearly. This has been changed to a simple parallel reduction algorithm.

Tested changes with `python3 test/test_nn.py`

```
Ran 3551 tests in 200.554s

OK (skipped=998, expected failures=4)
```

Performance before and after with the script below with an RTX 3090, batch size x axis, time (sec) y axis. This GPU is also used for display graphics and such, so the measurements are pretty noisy, even with 100 samples.

## Before
![before_nll](https://github.com/user-attachments/assets/c19044aa-7bc2-4223-b560-9be7acedef35)

## After ifdef removal
![after_nll](https://github.com/user-attachments/assets/4672f5ca-93b0-4c34-a257-81b2ab364995)

## After Parallel SMEM reduction

![after_reduction](https://github.com/user-attachments/assets/9607b68c-7d9d-4ee0-9f99-8989d134e4fd)

```python
import torch
from matplotlib import pyplot as plt
from torch.nn import functional as F

timing = []
batches=  list(range(32, 4096, 32))

for batch in [32] + batches:
    samples = []
    for _ in range(100):
        probs = torch.rand(batch, 10).cuda()
        labels = torch.randint(0, 10, (batch,)).cuda()
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
        F.nll_loss(probs, labels)
        end.record()
        torch.cuda.synchronize()
        elapsed = start.elapsed_time(end)
        samples.append(elapsed)
    timing.append(sum(samples) / len(samples))
timing = timing[1:]

plt.plot(batches, timing)
plt.show()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149779
Approved by: https://github.com/jeffdaily
2025-03-29 04:27:54 +00:00
a8dd9b6c27 [cuDNN][SDPA] abide by enable_gqa convention in cuDNN (#149976)
long overdue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149976
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-03-29 04:24:51 +00:00
340beb7f7c Add .editorconfig (#149193)
This adds an .editorconfig file to automatically configure devs local Editors / IDEs with the basic formatting rules of the project.

List of supported editors: https://editorconfig.org/#pre-installed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149193
Approved by: https://github.com/malfet
2025-03-29 04:07:21 +00:00
66a7a49d64 Super tiny fix typo (#149190)
... when checking the doc to build from source
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149190
Approved by: https://github.com/jingsh
2025-03-29 04:06:05 +00:00
5e787bf3e5 [reland] Support torchbind in OSS proxy executor (#150196)
Summary:
The original Diff D69500038 is reverted due to a false alarm on trunk health.

Implement torchbind support in OSSProxyExecutor.

Exactly the same as the implementation in FbProxyExecutor.

D69693697 - fbProxyExecutor
D69887230 - fbProxyExecutor but for torchbind method
D70746626 - Support None output type

Other changes:

- When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`).

- In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally (more details in internal Diff summary).

Note on using `filesystem`:

Seems like there'll be [issues](https://github.com/pytorch/pytorch/pull/137209) with using`filesystem` header in linux, so here I use string manipulation instead of `filesystem::path`.

Test Plan:
```
test/inductor:torchbind -- -r torchbind_aoti
test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D72063691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150196
Approved by: https://github.com/hl475, https://github.com/desertfire
2025-03-29 03:36:55 +00:00
0861af2596 [pytorch][triton] Warp specialization support in TritonTemplate for torchinductor (#148503) (#150122)
Summary:
Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`.

In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well.

NOTE: Currently gating changes to FBCODE using HAS_WARP_SPEC which is only available on triton/release-3.3.x

Test Plan:
## Unit test
Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for  `num_consumer_groups` and `num_buffers_warp_spec`.

## Functional Testing
Specific to flexattention.
```
import torch
from torch.nn.attention.flex_attention import flex_attention

from triton.testing import do_bench

make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16)
q, k, v = make_tensor(), make_tensor(), make_tensor()

flex_compiled = torch.compile(flex_attention, fullgraph=True)

print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4})))
```

triton do_bench results:
- default compile: 15.176783561706543
- with warp-spec: 9.452800750732422

## Extra notes
- generated triton kernel using `TORCH_LOGS=output_code`: P1740612877
- TTGIR for fused kernel: P1740614685

Differential Revision: D71982587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150122
Approved by: https://github.com/eellison, https://github.com/zou3519, https://github.com/jansel
2025-03-29 03:36:50 +00:00
03313c6619 [AOTInductor] Add function for users to extract constants in container (#150163)
Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor

Test Plan:
`python test/inductor/test_aot_inductor.py -k extract_constants_map`

`LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference`

Differential Revision: D72020400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163
Approved by: https://github.com/chenyang78
2025-03-29 03:36:12 +00:00
7a470c9320 [ROCm] change preferred blas lib defaults (#150212)
Fixes #148883
Fixes #150155

Also adds at::BlasBackend:Default. Instinct cards prefer hipBLASLt, everything else prefers rocBLAS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150212
Approved by: https://github.com/jeffdaily
2025-03-29 03:33:07 +00:00
29b3fdab01 TCPStoreLibUvBackend: support masterListenFd (#150215)
This supports `masterListenFd` which is required for full compatibility with the non-libuv TCPStore. The code was just missing a `uv_listen` call and now it works just fine.

This is required to migrate the last remaining uses of TCPStore off of the non-libuv backend.

Test plan:
```
pytest -v test/distributed/test_store.py -k test_take_over_listen_socket
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150215
Approved by: https://github.com/fduwjj
2025-03-29 01:58:07 +00:00
493c7fa66f [Cmake] Make PyTorch buildable by CMake-4.x (#150203)
By turning on compatibility mode for protobuf, nnpack, PSimd and FP16, ittapi, TensorPipe and Gloo
Update CMake requirements

 Revert 0ece461ccafe5649d2d0f058ff5477765fd56499 and b0901d62ae2c2e909f91401eacebf3731df20cbe to test that it actually works

TODO:
  - Update/get rid of those libraries

Fixes https://github.com/pytorch/pytorch/issues/150149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150203
Approved by: https://github.com/clee2000
2025-03-29 01:39:13 +00:00
edb6f1b7a8 Move MacOS inductor tests to M2-15 runner (#150228)
To get more representative results (and be able to run more tests eventually)
Also get pull_request for workflow dispatch if yml file is modified
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150228
Approved by: https://github.com/clee2000
2025-03-29 01:36:07 +00:00
65139eb050 if blaslt fails, fall back to blas (#150147)
Fixes #150016.

This is implemented for both cublaslt and hipblaslt. gemm_and_bias on failure will fall back to unfused path. lt gemm on failure falls back to gemm even if gemm preference is set to lt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150147
Approved by: https://github.com/malfet
2025-03-28 23:39:53 +00:00
ccfde4dadf Revert "Move MacOS inductor tests to M2-15 runner (#150228)"
This reverts commit b1b58708b26a840f6bf0ccdd14a9916ff7291fb4.

Reverted https://github.com/pytorch/pytorch/pull/150228 on behalf of https://github.com/malfet due to  Should not have ignored lint signal ([comment](https://github.com/pytorch/pytorch/pull/150228#issuecomment-2762794366))
2025-03-28 23:05:27 +00:00
b1b58708b2 Move MacOS inductor tests to M2-15 runner (#150228)
To get more representative results (and be able to run more tests eventually)
Also get pull_request for workflow dispatch if yml file is modified
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150228
Approved by: https://github.com/clee2000
2025-03-28 22:15:40 +00:00
7ac0658757 Revert "[CI] Fix docker builds failing due to cmake update by setting CMAKE_POLICY_VERSION_MINIMUM (#150220)"
This reverts commit 87549a65c96cd7e48f024c02e7daa3f227b2bf18.

Reverted https://github.com/pytorch/pytorch/pull/150220 on behalf of https://github.com/clee2000 due to doesn't solve the problem since the installed cmake 4 stays on the system, resulting in failed pytorch builds later ([comment](https://github.com/pytorch/pytorch/pull/150220#issuecomment-2762623078))
2025-03-28 21:44:03 +00:00
4271ebdbdc Explicitly state that a test-infra branch cut is required (#150214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150214
Approved by: https://github.com/atalman
ghstack dependencies: #150210, #150211, #150213
2025-03-28 21:13:29 +00:00
2b2286c4ec Update reference for binary_build workflows (#150213)
There hasn't been a circleci for a looooong time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150213
Approved by: https://github.com/atalman
ghstack dependencies: #150210, #150211
2025-03-28 21:13:29 +00:00
4118d7307f Update referenced PRs for ecosystem library branch cut (#150211)
The old PRs had a lot of extra changes in them which are no longer needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150211
Approved by: https://github.com/atalman
ghstack dependencies: #150210
2025-03-28 21:13:22 +00:00
f231500c50 Mention the cherry-picker bot in the release docs (#150210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150210
Approved by: https://github.com/atalman
2025-03-28 21:13:15 +00:00
87549a65c9 [CI] Fix docker builds failing due to cmake update by setting CMAKE_POLICY_VERSION_MINIMUM (#150220)
Set the CMAKE_POLICY_VERSION_MINIMUM env var to make executorch and halide docker builds pass (they install from those repos which don't have cmake pinned)

This can be removed if executorch and halide update their builds and we update the hash?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150220
Approved by: https://github.com/atalman, https://github.com/malfet
2025-03-28 20:55:04 +00:00
cb83850a24 Fix docs format error in torch.nn (#150156)
Fixes #150152

Fix format error in [torch.nn.CosineSimilarity](https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html#torch.nn.CosineSimilarity), [torch.nn.KLDivLoss](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss) and other pages.

## Test Result

### Before

#### torch.nn.CosineSimilarity

![Image](https://github.com/user-attachments/assets/1ad633d9-dfaf-43f0-a536-9035a24bf858)

#### torch.nn.KLDivLoss

![Image](https://github.com/user-attachments/assets/20a001b0-1f66-414e-b554-11934d65a4bf)

### After
#### torch.nn.CosineSimilarity
![image](https://github.com/user-attachments/assets/a2d9ea8d-5637-4604-a0e4-9231a4deee44)

#### torch.nn.KLDivLoss
![image](https://github.com/user-attachments/assets/d0e319f9-a3b3-47a7-b2f8-060d46d53bc7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150156
Approved by: https://github.com/cyyever, https://github.com/malfet
2025-03-28 20:54:09 +00:00
7c65911b11 [MPS] Fix dot/mm for conj_tensors (#150157)
- Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key
- For matmul or dot, add `conjugateWithTensor:name:` calls before running the op
- Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo
- Filter  `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR)
- Preserve conj property when gathering the views, that fixes `cov` operator

Fixes https://github.com/pytorch/pytorch/issues/148156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157
Approved by: https://github.com/dcci
2025-03-28 20:36:44 +00:00
9092dd2e82 [CI] Disable some tests that are failing in periodic (#150059)
Disabling some tests to restore periodic

nogpu avx512 timeout:
59f14d19ae (38492953496-box)

profiler failure: 7ae0ce6360 (38461255009-box)

test_accelerator failure:
87bfd66c3c (39476723746-box)
origin: 146098

test_overrides failure:
bf752c36da (39484562957-box)
origin: 146098

inductor cpu repro:
bb9c426024 (38447525659-box)

functorch eager transforms:
8f858e226b (39488068620-box)
f2cea01f71 (39555064878)
b5281a4a18 (39599355600)
either 148288 or 148261?

2ec9aceaeb/1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150059
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet
2025-03-28 20:31:32 +00:00
2bd5bfa3ce [ROCm] use magma-rocm tarball for CI/CD (#149986)
Follow-up to #149902.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149986
Approved by: https://github.com/malfet
2025-03-28 19:28:50 +00:00
cdeb32d2d1 enable out variant of 2-shot reduction (#150153)
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
2025-03-28 19:06:03 +00:00
35ff5084e6 [CI] Remove the xpu env source for linux binary validate (#150138)
Due to we have enabled the xpu runtime pypi packages as dependencies directly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150138
Approved by: https://github.com/atalman
2025-03-28 17:25:37 +00:00
85079e4380 [TD] Enable TD on distributed cpu (#150028)
Enable TD on distributed cpu, I think the only reason it's not is because I forgot to enable it

Get rid of some of the statements that are no ops:
* asan uses default shard
* nogpu got moved to periodic
* no windows cuda testing anymore

Only thing on pull and trunk that doesn't use TD is dynamo_wrapped but I think it's fast enough to be ok for now, we can take another look after this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150028
Approved by: https://github.com/ZainRizvi
2025-03-28 17:19:11 +00:00
cf7447ae99 Revert "cpp_wrapper: Fix even more tests (#147225)"
This reverts commit d25acac357ff8663a7787e57e6bc5e69987a8f9a.

Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))
2025-03-28 17:07:52 +00:00
e691fcae0e Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)"
This reverts commit 2b20d1433f4e5c7556fe4679d89b8f795990d494.

Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))
2025-03-28 17:07:52 +00:00
b0901d62ae Pin cmake to 3.31.2 for windows conda install (#150185)
Trying to fix nightly failures
Cmake 4.0 update https://pypi.org/project/cmake/4.0.0/ broke nightly builds
You can see it here: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=cuda11_8-build
and here: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=
This fix for Windows Builds. Linux and MacOS where already fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150185
Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi
2025-03-28 17:03:02 +00:00
a469ddc663 [inductor] No type promotion for slice_scatter (#150090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150090
Approved by: https://github.com/eellison, https://github.com/zou3519
ghstack dependencies: #149087, #149667, #150036, #148953
2025-03-28 17:02:01 +00:00
1bdf996e7a [CI] Fix log artifact not containing test logs? (#149577)
Sometimes I would find a log artifact that only has usage_logs.txt in it, even though there are other logs created by tests.  I think this is somehow caused by output buffering with find.  I don't understand how, but at the very least, I can see that all the jobs on this PR have the logs from the test runs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149577
Approved by: https://github.com/ZainRizvi
2025-03-28 17:00:00 +00:00
d5a8bd0688 [CI][docker] Use multistage build for triton (#149413)
Sees to reduce docker pull times by ~3 min if triton is requested, some compressed docker sizes seems to have decreased by 1/3 ish

Also add check that triton is installed/not installed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149413
Approved by: https://github.com/malfet
2025-03-28 16:07:19 +00:00
0ece461cca Pin cmake==3.31.6 (#150158)
I'm not sure if this is the right think to do, but cmake 4.0.0 got released on pypi and our builds are failing with it

Example:
aa70d62041 (39555975425-box)

I guess we have to go change all the cmake_minimum_required to >=3.5?

backwards compat still failing because its building with the base commit which this pr can't really change until it gets merged, but at least manywheel binary builds got past where they were originally failing

Also pin the conda installation, but the most recent version on conda is 3.31.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150158
Approved by: https://github.com/cyyever, https://github.com/malfet
2025-03-28 15:49:17 +00:00
350a479146 Fix test failures on non-x86 Linux (#148445)
The cpp contexts are only supported on x86 Linux.
The tests requiring them are skipped on non-Linux but not if the architecture is not x86.
In most places it is checked for ARM64 which is not enough as a check for x86 is required instead.

Fix the test decorators and factor out a common one in test_cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148445
Approved by: https://github.com/eellison
2025-03-28 15:27:44 +00:00
d2c0c65ea1 [Dynamo] Add debug linting option for graph dedupe (#150053)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150053
Approved by: https://github.com/StrongerXi, https://github.com/anijain2305
2025-03-28 14:27:09 +00:00
25309a17f0 [aotd] Config to guess_tangents_stride (#150035)
Differential Revision: [D71907684](https://our.internmc.facebook.com/intern/diff/D71907684)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150035
Approved by: https://github.com/ilyas409, https://github.com/seemethere
2025-03-28 13:54:19 +00:00
7c4e49750e Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)"
This reverts commit c16af5d7984872b6ae81476d6cae64bddb7ce664.

Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/jamesjwu due to Sorry I forgot to fix one last test ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2761381443))
2025-03-28 13:35:07 +00:00
c16af5d798 Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
2025-03-28 13:28:05 +00:00
d4da0e955e [Dynamo] Fix is_compile_supported() when device_type contains device index (#147837)
Fixes #147826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147837
Approved by: https://github.com/anijain2305
2025-03-28 07:16:29 +00:00
103bf64a3c [export] refactor _Dim into Dim (#149891)
Summary: forward fix T218515233

Test Plan: test_export

Differential Revision: D71769231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149891
Approved by: https://github.com/jingsh, https://github.com/angelayi
2025-03-28 06:19:03 +00:00
f649ee73ce Use source hashing to generate consistent symbolic ids (#149665)
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows

Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic

Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....

Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.

We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
2025-03-28 05:36:32 +00:00
c49315e645 Improve attr mismatch msg (#149576)
Differential Revision: [D71513041](https://our.internmc.facebook.com/intern/diff/D71513041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149576
Approved by: https://github.com/avikchaudhuri
2025-03-28 05:10:56 +00:00
fdc4394b16 Do not fetch NCCL when system NCCL is used (#149607)
We are compiling PyTorch in a sandbox without networking. Unconditionally fetching breaks the build and is not needed when a system NCCL is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149607
Approved by: https://github.com/malfet
2025-03-28 05:06:49 +00:00
c9ebf517c2 [dynamo][invoke_subgraph] Input aliasing and mutation check in Dynamo (#148953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148953
Approved by: https://github.com/zou3519
ghstack dependencies: #149087, #149667, #150036
2025-03-28 03:50:07 +00:00
c18e2ce53b Ignore meta ops in inductor (#150137)
Fix for https://github.com/pytorch/pytorch/issues/144607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150137
Approved by: https://github.com/BoyuanFeng
2025-03-28 03:01:57 +00:00
ddb1e97839 Revert "Support torchbind in OSS proxy executor (#149747)"
This reverts commit aa70d62041c28fe35c416aa932b32ef0e4d5bc33.

Reverted https://github.com/pytorch/pytorch/pull/149747 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149747#issuecomment-2760040741))
2025-03-28 02:48:02 +00:00
2f785ab208 dynamo_compile: Log all compilation time under all_compilation_types (#149664)
This counter is designed to include all compilation pytorch does (triton +
dynamo_compile). However this wasn't including all of dynamo compilation, since
it was put in at the fx_codegen_and_compile spot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149664
Approved by: https://github.com/masnesral
2025-03-28 02:27:48 +00:00
8a872261dc Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)
Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129
Approved by: https://github.com/xw285cornell
2025-03-28 02:14:27 +00:00
1e55b9c0b5 Fix autotune pool shutdown (#149890)
Summary: A couple follow-ups noted in review from https://github.com/pytorch/pytorch/pull/149700:
1. Make sure we correctly signal _all_ subproces to shutdown, even in the case where some processes are currently benchmarking.
2. Change how the pool singleton is created. That also allows us to fully initialize the object in the ctor and remove a bunch of asserts.

Test Plan: existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149890
Approved by: https://github.com/aorenste
ghstack dependencies: #149700
2025-03-28 02:09:51 +00:00
266bd22b44 Improve subproc autotuning implementation (#149700)
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.

One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`

List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.

Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).

Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
2025-03-28 01:06:39 +00:00
8b04364914 [Easy/Profiler] Set Duration to -1 for unfinished CPU events (#150131)
Summary: Some OSS Kineto users were requesting that we allow for 0 duration events in Kineto even though they won't be seen on the trace. To allow this we changed the handling of said events in D71510383. However this causes unfinished events in collection to never be post processed; this diff fixes said issue.

Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1743102222/localhost/libkineto_activities_631490.json.gz&bucket=gpu_traces

Differential Revision: D71993609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150131
Approved by: https://github.com/chuanhaozhuge, https://github.com/xw285cornell
2025-03-28 00:29:22 +00:00
aa70d62041 Support torchbind in OSS proxy executor (#149747)
Summary:
Implement torchbind support in OSSProxyExecutor.

Exactly the same as the implementation in FbProxyExecutor.

D69693697 - fbProxyExecutor
D69887230 - fbProxyExecutor but for torchbind method

Other changes:

- When generating the schema of the CallTrochBind HOP, the arg name of the torchbind object arg should be the same as the torchbind method's torchbind object arg (instead of `obj`).

- In `AOTIModelPackageLoader`, we extract everything in `data/constants` to `tmp_dir/data/aot_inductor/<model>/` folder, so the torchbind objs exist in the same folder as the rest of the files (e.g. cpp, so). This is to be consistent of how files are packaged internally

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r torchbind_aoti

buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D69500038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149747
Approved by: https://github.com/desertfire
2025-03-28 00:04:19 +00:00
d670df356c Improve error handling when checking CUDA version in case nvcc is not found (#148671)
Fixes:
- https://github.com/pytorch/pytorch/issues/101138

**Description**
The PR enhances error handling in `_check_cuda_version` by verifying the existence of the `nvcc` executable before invoking `subprocess.check_output`. If `nvcc` is missing, a `FileNotFoundError` is raised with a clear message, guiding users to check their CUDA installation and path configuration.

**Testing**
Manually tested with and without `nvcc` present in the expected path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148671
Approved by: https://github.com/malfet
2025-03-27 23:04:59 +00:00
2b20d1433f cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)
Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes #142005.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350
Approved by: https://github.com/desertfire
ghstack dependencies: #147225
2025-03-27 23:00:01 +00:00
ef1cb6b646 [BE] Suppress user_warnings while running opinfo tests (#150115)
Some of the samples are constructed in a way that are expected to trigger those, but what's the point displaying them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150115
Approved by: https://github.com/dcci
ghstack dependencies: #150060
2025-03-27 22:36:27 +00:00
1a3bd894ff Revert "[fbcode]Removing @NoIntBaseDeprecated annotation in caffe2.thrift file (#149742) (#149744)"
This reverts commit 6eac3a0068f028d03897ce38e0cfec11812591fe.

Reverted https://github.com/pytorch/pytorch/pull/149744 on behalf of https://github.com/malfet due to Broke tests, see 80aa88f907/1 ([comment](https://github.com/pytorch/pytorch/pull/149744#issuecomment-2759676260))
2025-03-27 22:31:54 +00:00
4c57aec5b9 Dont exclude constant_pad_nd in prologue fusion (#149947)
Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues.

For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd.

```
import torch
import torch.nn.functional as F
torch._inductor.config.force_disable_caches = True

padded_N = 2048
n_pad_rows = 100

K, N = 2048, 4096

tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16)
tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16)

@torch.compile(mode='max-autotune-no-cudagraphs')
def masked_linear(input, weight, n_pad_input_rows):
    """
    Linear layer with input padded by `n_pad_input_rows` rows
    """
    # Use constant_pad_nd to pad with zeros for the invalid rows
    padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0)
    return F.linear(padded_input, weight)

# Invoke the function
masked_linear(tensor1, tensor2, n_pad_rows)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947
Approved by: https://github.com/drisspg
2025-03-27 22:26:30 +00:00
80aa88f907 Revert "Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)"
This reverts commit ac91f8765ba7817a0853f0520e7f9c94768babc2.

Reverted https://github.com/pytorch/pytorch/pull/149054 on behalf of https://github.com/yangw-dev due to This is breaking ROCM tests on trunk. hud.pytorch.org/ ([comment](https://github.com/pytorch/pytorch/pull/149054#issuecomment-2759604301))
2025-03-27 22:15:40 +00:00
21bcbbfb5e fix range constraints for expr (#150103)
During tracing it is possible for a `s1: VR[2, inf]` to be replaced by a `s0: VR[3, inf]` (note smaller range) by the shape env. But after export, unfortunately we'd previously record `range_constraints[s0] = VR[2, inf]` (note larger range), which is incorrect.

This is because we'd map `s1.node.expr` (`s0`) to the `var_to_range` of `s1.node._expr` (`s1`) when creating `range_constraints`. The comment surrounding this code suggests this predated `bound_sympy`, but now we can do better.

For users, this means that when using `Dim.DYNAMIC` previously they wouldn't get input constraints checked sufficiently, now they do (shifting errors early).

Differential Revision: D71962694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150103
Approved by: https://github.com/zhxchen17
2025-03-27 22:11:39 +00:00
68414512e6 Implement aten.select.int sharding strategy (#149842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149842
Approved by: https://github.com/XilunWu
2025-03-27 20:49:00 +00:00
d25acac357 cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
2025-03-27 19:21:03 +00:00
0ed0b7fa96 [aoti] Better error message when torchbind object is used as a graph input in AOTI (#149965)
Summary: Given an explicit error when torchbind object is used as input to AoTI

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_input
```

Differential Revision: D69490915

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149965
Approved by: https://github.com/desertfire
2025-03-27 18:48:55 +00:00
a9d08ed0ce Revert "Parallelize sort (#149505)"
This reverts commit 842d51500be144d53f4d046d31169e8f46c063f6.

Reverted https://github.com/pytorch/pytorch/pull/149505 on behalf of https://github.com/ZainRizvi due to Reverting since this is breaking inductor builds on trunk. More details [GH job link](https://github.com/pytorch/pytorch/actions/runs/14000726218/job/39207447863) [HUD commit link](842d51500b) ([comment](https://github.com/pytorch/pytorch/pull/149505#issuecomment-2759082390))
2025-03-27 18:43:11 +00:00
01cb3519b3 wire torch._scaled_mm with fp4 operands to the cublas nvfp4 kernel (#148792)
Summary:

When `a` and `b` have dtype `torch.float4_e2m1fn_x2` and `a_scale` and `b_scale` have dtype `torch.float8_e4m3fn`, makes

```python
c = torch._scaled_mm(a, b, a_scale, b_scale, out_dtype=torch.bfloat16)
```

call the cuBLAS fp4 gemm kernel, as specified in https://docs.nvidia.com/cuda/cublas/index.html?highlight=fp4#d-block-scaling-for-fp8-and-fp4-data-types

note: output scale (`scale_in_D` from the cuBLAS docs) is not tested in this PR - we can enable in a follow-up.

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k mxfp8_nvfp4
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148792
Approved by: https://github.com/eqy
ghstack dependencies: #148791
2025-03-27 17:32:20 +00:00
e33bc41958 add torch.float4_e2m1fn_x2 to PyTorch (#148791)
Summary:

Redo of https://github.com/pytorch/pytorch/pull/146578 to get around
rebase conflicts.

Test Plan:

```
pytest test/quantization/core/experimental/test_floatx.py -s
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148791
Approved by: https://github.com/drisspg, https://github.com/eqy, https://github.com/jeffdaily
2025-03-27 17:32:20 +00:00
ac91f8765b Store statically launchable CachingAutotuners inside CompiledFXGraph.triton_bundle (#149054)
This PR adds CachingAutotuners that are statically launchable to FXGraphCache's cache entry.

Regular CachingAutotuners, with triton kernels attached to them, are not very good to cache: they are very large, and take huge amounts of space since they track all of the various binary files, along with various metadata. We could probably figure out what information we could delete from the kernel and have it still work, but with StaticCudaLauncher, we no longer have to. Instead, we can cache every compiled triton kernel that is statically launchable.

Because StaticTritonCompileResult is serializable, and designed to have a very small memory footprint, we can save it into FXGraphCache without increasing the cache size significantly. We store it as a part of CompiledFxGraph.triton_bundle.

Then, on load, we repopulate the CachingAutotuner into our CompiledTritonKernel cache.

The upsides of this are many:
- We no longer need to call into a separate process on cache hit
- We can *guarantee* that the triton kernel we got from our cache entry is the one we use to launch again, so no worries about triton's own caching logic
- Once we achieve feature parity and all torch.compiled triton kernels are statically launchable, we can clean up a bunch of TritonBundler code and simplify the cache hit logic.

Fixes #149449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149054
Approved by: https://github.com/oulgen
ghstack dependencies: #149657
2025-03-27 17:14:44 +00:00
6eac3a0068 [fbcode]Removing @NoIntBaseDeprecated annotation in caffe2.thrift file (#149742) (#149744)
Summary:

To align with thrift-python, we are adding the int base class for `non-Flag` enums. In order to not break production code, the annotation `python.NoIntBaseClassDeprecated` is added to opt-out some enums

After the related customer code logic changes, we can now safely remove the annotations that were added earlier.

Our ultimate goal is to unconditionally add the `int` base to `thrift-py3` enums.

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/bulk_eval/tests:evaluator_test - test_setup_evaluation_utils (caffe2.torch.fb.training_toolkit.applications.bulk_eval.tests.evaluator_test.EvaluatorTest)'
```

Reviewed By: ahilger

Differential Revision: D71446522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149744
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2025-03-27 17:11:26 +00:00
14f0cd7630 [StaticCudaLauncher] Support sharedMemBytes > 48KB (#149657)
Triton does some special handling when requesting more than 48 KB of shared memory: specifically it queries the device for maximum device memory, then sets the maximum amount of dynamic memory to be the difference between static and dynamic memory.

See corresponding implementation in triton land here:
https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L128-L143

Test plan:
- New unit test requesting more than 48 KB of memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149657
Approved by: https://github.com/jansel
2025-03-27 17:00:18 +00:00
85e4e51a7d Fix bug in _load_state_dict_from_keys method (#150058)
Summary:
The _load_state_dict_from_keys method specifies that `Loads any key specified in this set. If no keys are specified, the entire checkpoint is loaded.`
But this isn't happening right now, because an empty keys arg is passed in as a set() to `_load_state_dict` and keys is expected to be None for it to actually be included in the state_dict https://fburl.com/code/l8yzojyx. So with the set() argument, the state_dict is always going to be empty

Test Plan: ensure existing tests pass

Differential Revision: D71930712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150058
Approved by: https://github.com/saumishr
2025-03-27 16:36:00 +00:00
d75921d3a6 Fix sparse CUTLASS-based kernels (#150023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150023
Approved by: https://github.com/jcaip
ghstack dependencies: #149978
2025-03-27 16:23:55 +00:00
c830d750e6 [graph partition] support splitting on custom ops (#149782)
This PR adds support for graph partition on custom ops. Land after #149458.

### API
This PR provides a new API to register/unregister custom ops for graph partition.

```python
def register_custom_op_support_cudagraph(
    operator: torch._library.custom_ops.CustomOpDef,
    is_cudagraphable: bool,
) -> None
```

Example usage:

```python
from torch._inductor.utils import register_custom_op_partition

@torch.library.custom_op("mylib::movement", mutates_args=())
def movement(pic: torch.Tensor) -> torch.Tensor:
    img = pic.cpu()
    cropped_img = (img + 1) * 2
    return cropped_img.cuda() / 255.0

@movement.register_fake
def _(pic):
    return torch.empty_like(pic)

register_custom_op_support_cudagraph(movement, is_cudagraphable=False)
```

### Example
In this example, 1 torch-compiled region has 3 cudagraphs after splitting on 2 custom ops.

![image](https://github.com/user-attachments/assets/6d07355b-6690-4cde-89ef-e4aff6b0079c)

Code to repro:
```python
import torch
from torch._inductor.utils import register_custom_op_support_cudagraph

torch._inductor.config.graph_partition = True

@torch.library.custom_op("mylib::movement", mutates_args=())
def movement(pic: torch.Tensor) -> torch.Tensor:
    img = pic.cpu()
    cropped_img = (img + 1)*2
    return cropped_img.cuda() / 255.

@movement.register_fake
def _(pic):
    return torch.empty_like(pic)

@torch.library.custom_op("mylib::modify", mutates_args=())
def modify(pic: torch.Tensor) -> torch.Tensor:
    pic1 = pic + 1
    pic1_cpu = (pic1.cpu() + 1) * 2
    return pic1_cpu.cuda() + pic

@modify.register_fake
def _(pic):
    return torch.empty_like(pic)

@torch.library.custom_op("mylib::transform", mutates_args=())
def transform(pic: torch.Tensor) -> torch.Tensor:
    return (pic + 1) * 2

@transform.register_fake
def _(pic):
    return torch.empty_like(pic)

register_custom_op_support_cudagraph(movement, is_cudagraphable=False)
register_custom_op_support_cudagraph(modify, is_cudagraphable=False)

img = torch.randn(3, 64, 64, device="cuda")

def f(img):
    x = (img + 10) * 2
    y = movement(x)
    z = y + 1
    u = transform(z)
    v = 2*u + 1
    out = modify(v)
    return out + 1

compiled_f = torch.compile(f, mode="reduce-overhead", fullgraph=True)

eager_out = f(img)

for _ in range(3):
    compiled_out = compiled_f(img)
    assert torch.allclose(eager_out, compiled_out)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149782
Approved by: https://github.com/zou3519
2025-03-27 16:23:07 +00:00
efc975feb2 Revert "[triton] Warp specialization support in torchinductor (#148503)"
This reverts commit 36183215e8845b54cdb69097e2b688fa9e4d3daf.

Reverted https://github.com/pytorch/pytorch/pull/148503 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148503#issuecomment-2758590645))
2025-03-27 16:06:42 +00:00
af7719a2fa Revert "Use source hashing to generate consistent symbolic ids (#149665)"
This reverts commit 1f92348dc6c60e3020a723b37ecb8226cf2480c0.

Reverted https://github.com/pytorch/pytorch/pull/149665 on behalf of https://github.com/malfet due to Broke trunk, see 6eb3c2e282/1 ([comment](https://github.com/pytorch/pytorch/pull/149665#issuecomment-2758578187))
2025-03-27 16:02:27 +00:00
6eb3c2e282 Update xla pin (#149381)
Update xla pin to fix the github test failure issue. [failure link](https://hud.pytorch.org/failure?name=pull+%2F+linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&jobName=linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&failureCaptures=%5B%22test_call_jax_pytree%22%2C%22TestJaxInterop%22%5D).

The test is run the torch_xla jax test but install the jax/jaxlib dependencies as we did in https://github.com/pytorch/xla/pull/8781/files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149381
Approved by: https://github.com/atalman
2025-03-27 13:53:25 +00:00
36183215e8 [triton] Warp specialization support in torchinductor (#148503)
Summary:
Currently only `num_warps` and `num_stages` are supported as one of the kernel options for inductor auto-tuning using `TritonTemplate`. In order to allow warp-specialization kernel options should allow specifying `num_consumer_groups` and `num_buffers_warp_spec` as well.

Test Plan:
## Unit test
Added tests for `test_triton_template_warp_specialization` to verify generated kenrnel contains configs for  `num_consumer_groups` and `num_buffers_warp_spec`.

## Functional Testing
Specific to flexattention.
```
import torch
from torch.nn.attention.flex_attention import flex_attention

from triton.testing import do_bench

make_tensor = lambda: torch.rand(8, 16, 8192, 128, device="cuda", dtype=torch.bfloat16)
q, k, v = make_tensor(), make_tensor(), make_tensor()

flex_compiled = torch.compile(flex_attention, fullgraph=True)

print(do_bench(lambda: flex_compiled(q, k, v, kernel_options={"num_warps": 4})))
```

triton do_bench results:
- default compile: 15.176783561706543
- with warp-spec: 9.452800750732422

## Extra notes
- generated triton kernel using `TORCH_LOGS=output_code`: P1740612877
- TTGIR for fused kernel: P1740614685

Differential Revision: D70212243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148503
Approved by: https://github.com/eellison
2025-03-27 13:07:50 +00:00
f0e1a0838c Enabling xpu in OffsetBasedRNGTracker . (#148360)
Else torch.distributed breaks on xpu devices.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360
Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-03-27 10:55:05 +00:00
e175929b8c Make codegen dynamic shapes more device agnostic (#146830)
Currently, as is the case with many inductor devices are assumed to be one of:

- CPU with Cpp coden, or
- GPU with triton codegen

This is not always the case, a CPU backend may be using the triton CPU backend, or some other codegen entirely. This goes some way to fixing it in the case where a CPU backend can use triton scheduling.

A more general solution could be implemented, but this would need to be quite robust, and is probably best done more centrally and by someone who can do more testing with CUDA devices.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146830
Approved by: https://github.com/eellison, https://github.com/albanD, https://github.com/guangyey

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-03-27 10:40:49 +00:00
6cbcdee944 Introduce guard_or_true, guard_or_false (#148430)
some context in this document:
https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj

But TLDR;
`guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to :
- Easier to reason about what assumptions we are making while reading the code.
- Avoid size_oblivious complexity that is not needed.
- Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime.
- Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`.

### How is it different from statically_known_true??
**`if(cond)`:** (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations).

**`statically_known_true(cond)`:** would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints.

**`guard_or_true(cond)`/`guard_or_false(cond)`:** Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false.
Some reasons you might be ok with returning true/false instead could be:
1. It's an optimization I do not want to fail for not performing optimization.
2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details).

**`definitely_true(cond)`**: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430
Approved by: https://github.com/bobrenjc93
2025-03-27 09:34:05 +00:00
a9ee797e41 added fake tensor support for foreach_copy (#149127)
Fixes #149111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149127
Approved by: https://github.com/jansel, https://github.com/jeromean
2025-03-27 09:26:23 +00:00
7aacbab0b3 Update Doc for Intel XPU Profiling (#134515)
Updated below two pages for Intel XPU
https://pytorch.org/docs/stable/torch.compiler_profiling_torch_compile.html
https://pytorch.org/docs/stable/profiler.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134515
Approved by: https://github.com/dvrogozh, https://github.com/malfet
2025-03-27 09:15:35 +00:00
e6afb51805 [AOTInductor] Free folded constants that's managed by AOTInductor (#149825)
internally.

Summary:
This diff allows freeing the usage of folded constants that's created by
AOTInductor through CUDACachingAllocator instead of the constant blob
from cudaMalloc directly.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh
2025-03-27 06:05:50 +00:00
e080bac533 Revert "Introduce guard_or_true, guard_or_false (#148430)"
This reverts commit d5593ea31ceb2590336cc9815ee2c13a18db6cd7.

Reverted https://github.com/pytorch/pytorch/pull/148430 on behalf of https://github.com/laithsakka due to need to fix stuff ([comment](https://github.com/pytorch/pytorch/pull/148430#issuecomment-2756701436))
2025-03-27 05:10:20 +00:00
748252378d [ca] introduce RuntimeState to support c++ hooks via graph breaks (#149987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149987
Approved by: https://github.com/jansel
ghstack dependencies: #149647, #149709, #149651, #149897
2025-03-27 05:05:34 +00:00
dcb378cff2 [ca] support anomly mode nan checks with different semantics than eager (#149897)
see note in code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149897
Approved by: https://github.com/jansel
ghstack dependencies: #149647, #149709, #149651
2025-03-27 05:05:34 +00:00
488b87cb68 [BE] do not retain/release tensor (#150075)
`Tensor::as_strided__symint` is inplace op that returns self, no need to retain it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150075
Approved by: https://github.com/angelayi, https://github.com/atalman, https://github.com/cyyever
2025-03-27 03:43:14 +00:00
1f92348dc6 Use source hashing to generate consistent symbolic ids (#149665)
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows

Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic

Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....

Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.

We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
2025-03-27 03:39:27 +00:00
ae29f054f5 [Async TP] More robust support for rowwise scales when fusing matmul reduce-scatter (#149247)
Part of https://github.com/pytorch/torchtitan/issues/866

## Context
- Async TP needs to support the "reshape -> scaled_mm -> reshape" pattern because scaled mm only supports 2D input tensors and 2D scales.
    - (a,b,c) => (a*b,c)
    - (a\*b,c) @ (c,d) = (a\*b,d)
    - (a\*b,d) => (a,b,d)

- Currently the implementation does not support scaled mm with rowwise scales **for all cases** of the reshape -> scaled_mm -> reshape pattern. The minimal example of this pattern is confirmed to work via this [unit test](00a2c68f67/test/distributed/tensor/parallel/test_micro_pipeline_tp.py (L406)), but more involved e2e examples in torchtitan fail silently (more context in final bullet point).
- Previously, the "A tensor" **node** referenced in the async TP graph manipulation code is the 3D+ node before the reshape, but the "A_scale" node is the 2d node from after the reshape, so they are incompatible.
- I previously implemented a simpler solution to this problem in https://github.com/pytorch/pytorch/pull/148001, with a [unit test](https://github.com/pytorch/pytorch/pull/148001/files#diff-115f1d0852382c9b58f22640d80999d879b33618e5f6c633fc9e4d0ca9781cecR406) confirming the fused node is indeed in the graph for the minimal example of the reshape->mm->reshape pattern. I also confirmed via manual e2e testing w/ torchtitan that the crash I was fixing no longer occurred. However, it turns out due to this [bug in torchtitan](https://github.com/pytorch/torchtitan/issues/866)  it was causing async TP to fail silently and fall back to vanilla TP, hiding the fact that this original solution fixed the crash but the fusion would not occur for rowwise scales. Thus, more robust solution is needed to support all cases.

## Solution TL;DR
- Use the 2D 'A' tensor and corresponding 2D scales as input to the fused_matmul_reduce_scatter implementation, instead of the 3D+ tensor/scales.
- Track the "pre mm reshape" and "post mm reshape" separately, to be referenced in the `fused_scaled_matmul_reduce_scatter` implementation, to update the scatter dim through the pre-mm reshape, and apply the post-mm reshape before applying the reduce scatter and returning the output tensor.
- Separate the `fused_matmul_reduce_scatter` and the `fused_scaled_matmul_reduce_scatter` code paths, to simplify them both.
- By fixing the bug in torchtitan (PR https://github.com/pytorch/torchtitan/pull/965) and implementing support for rowwise scales in pytorch in this PR, together these changes will solve the problem of how to support rowwise scales with all types of AC.

## Additional details for reviewers
To use the 2D A tensor while also supporting the "reshape -> mm -> reshape" pattern, the following other changes were needed:
- Track the pre-mm reshape, as it will affect the scatter dim used in the fused_matmul_reduce_scatter impementation.
- Track the post-mm reshape, as it will affect the output shape used in the fused_matmul_reduce_scatter impementation
- Based on the pre-mm reshape and the original scatter dim, calculate the new scatter dim for the 2D tensor. This is needed because during the pipelined producer mm implementation, the scatter dim is moved to dim 0 (so it can be sharded along the first dim and then get chunks to do mm ops on by indexing into the first dim), then moved back to it's original place before the reduce-scatter.
- Use the tracked post-mm reshape to reshape the stacked partial 2D outputs of the mm ops into 3D outputs needed for 1) the reduce-scatter w/ the original scatter dim, and 2) the expected output shape to prevent shape errors with subsequent ops.

## Test plan
- All existing unit tests passing.
- Expand unit tests for rowwise scales to test more scatter dims
- Added unit tests enforcing that async TP fails fast / throws an error if it fails to perform any fusions. Previously it just "failed silently" (fell back to vanilla TP without the user knowing) which has led to confusion, so this will improve the UX.
- Compared loss curves of bf16 vs float8 w/ rowwise scales to confirm integrity of numerics
- Confirmed via manual testing with torchtitan and inspecting the compile graph that the fusion is working as intended for:
    - bfloat16
    - float8 with tensorwise scales
    - float8 with rowwise scales

## Loss curves

Loss curves are virtually identical for bf16 + vanilla TP versus float8 with rowwise scales + async TP:

<img width="1017" alt="loss_async_tp" src="https://github.com/user-attachments/assets/4995db78-7012-490f-a370-f4fecc289a22" />

## Performance

#### Per op SAC
Performance benchmarks for torchtitan Llama3 8b training runs on 4 H100s with per op SAC, using FSDP degree=2, TP degree=2:
- bf16 (vanilla TP): TPS 5161.5, peak memory 50.53 GB
- bf16 (async TP): TPS  5229.5, peak memory 50.68 GB
- float8 tensorwise (vanilla TP): TPS: 5959.5, peak memory: 50.47 GB
- float8 tensorwise (async TP): TPS 5964.5, peak memory 50.47 GB
- float8 rowwise (vanilla TP): TPS: 4962.0, peak memory: 50.55 GB
- float8 rowwise (async TP): TPS 4966.5, peak memory 50.65 GB

#### Full AC
Llama3 70b training runs on 128 H100s with full AC, using FSDP=16, TP=8
- bf16 (vanilla TP): 598 TPS, peak memory 71.51 GB
- bf16 (async TP): TPS  673, peak memory 71.08 (+12.54% TPS vs vanilla TP)
- float8 tensorwise (vanilla TP): 820 TPS, peak memory  55.26 GB
- float8 tensorwise (async TP): 950 TPS, peak memory 55.91 GB (+15.85% TPS vs vanilla TP)
- float8 rowwise (vanilla TP): TPS: 540 TPS, peak memory 71.46 GB
- float8 rowwise (async TP): 560 TPS, peak memory 70.65 GB (+3.7% TPS vs vanilla TP but still unexpectedly lower than bf16)

As you can see, float8 rowwise is working but performance needs to be improved further.

## Other changes
- Added logging so the user will know why fusion failed if it does.
- Remove logic which inserted a reshape node targeting "A scale" to get it to be in 3D like the "A tensor" since it's no longer needed.

## Long term plan
- Add a `scaled_matmul` op in pytorch, which will natively support a 3D+ "A tensor" and allow us to simplify the async TP implementation by avoiding the reshape -> scaled_mm -> reshape pattern and the special handling for it.

## Visualizing fused nodes in graphs for torchtitan training runs

Below are examples of the visualized graph generated by torch compile for torchtitan llama3 8b training runs with per op SAC. These graphs provide additional evidence (beyond the new unit tests added) that the implementation is working correctly.

### bf16

<img width="900" alt="bf16-fusion" src="https://github.com/user-attachments/assets/a3bed917-28eb-4a56-8d6e-2d2bf498385c" />

### float8 with tensorwise scales

<img width="900" alt="tensorwise-node" src="https://github.com/user-attachments/assets/b212ec4a-1899-44de-a4de-18c74e1de68a" />

### float8 with rowwise scales

<img width="900" alt="rowwise" src="https://github.com/user-attachments/assets/ed3354a3-894b-4ec9-86d0-f80364bf3d83" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149247
Approved by: https://github.com/kwen2501
2025-03-27 03:15:30 +00:00
114d404b07 [cuda] Add new faster gammabeta backward kernel (#148605)
This PR adds a new kernel for producing gamma and beta values for the backward pass in a performant way.

To test the performance against the baseline, I measured the backward pass of layernorm while sweeping over the following variables:

1. dtype in {half, float}
2. M in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
3. N in `2**k, 2**k - 1, 2**k + 1 for k in range(...)`
4. Whether we flush the L2 cache before running the backward pass

Summary: The new code performs better than the old code, especially for powers of 2. For M >> N case, it performs very well (kernel itself can be 30x faster and the overall backward pass can be 5-10x faster).

In order to visualize results of the kernel when choosing different values of M, N and dtype, I wrote some code to generate a heatmap. The heatmap has N on the x-axis, M on the y-axis and color-coded points where green shows performance improvement and red shows regressions. For example, `m=32 n=2048 1.42x` in the heatmap would indicate the normalized shape had 32 elements. The leading dimensions' product was 2048 elements and the new kernel resulted in the *backward pass* being 1.42x faster than the old *backward pass*.

Important note: This heatmap shows the total backward pass time as seen by the user. The kernel time difference can be sometimes very large while the total backward pass time is not that high. For example, for dtype=torch.half, M=32 N=2048, flush_l2_cache=True case, the heatmap shows a speedup of 1.42x, while ncu tells me the new kernel is 2.5x faster than the old:

M=32 N=2048 dtype=half flush_l2=True Old Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.35
    Elapsed Cycles                cycle       27,526
    Memory Throughput                 %         2.21
    DRAM Throughput                   %         0.54
    Duration                         us        20.42
    L1/TEX Cache Throughput           %         4.31
    L2 Cache Throughput               %         2.62
    SM Active Cycles              cycle     1,475.02
    Compute (SM) Throughput           %         0.29
    ----------------------- ----------- ------------
```

M=32 N=2048 dtype=half flush_l2=True New Kernel NCU summary:
```
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.59
    SM Frequency                    Ghz         1.34
    Elapsed Cycles                cycle       10,920
    Memory Throughput                 %         5.64
    DRAM Throughput                   %         1.35
    Duration                         us         8.13
    L1/TEX Cache Throughput           %         1.92
    L2 Cache Throughput               %         6.89
    SM Active Cycles              cycle     3,554.41
    Compute (SM) Throughput           %         0.67
    ----------------------- ----------- ------------
```

Let's look at some rows from the heatmap. For dtype=float16 flush_l2_cache=True and when input shapes are powers of 2, we get the following:

<img width="1508" alt="image" src="https://github.com/user-attachments/assets/06179599-b2f0-4a45-8664-247a1067950b" />

There are 3 columns -- the first shows all data points, the second shows speedups only and the 3rd column shows regressions only. We can see that there are dramatic speedups for M >> N cases and the regressions are not that high (less than 1%, which could just be measurement noise). Here is a small guide I made:

![image](https://github.com/user-attachments/assets/90c26f7c-e3ad-46d2-a6ce-fe4b5fb3d738)

For dtype=float32, we get a similar chart:

<img width="1499" alt="image" src="https://github.com/user-attachments/assets/c4d31a76-03b0-426c-9114-e1bfad29b530" />

The new code performs especially well for m >> n cases, and also where m and n are small. The m >> n case is special because we run 2 reduction kernels back to back and parallelize in the "M" dimension (the older kernel only parallelized in the "N" dimension).

The new code can sometimes have regressions for non-powers of 2. That is because the old code was using block sizes of {16, 32} while we have `threads.x = 32`. For example when N=33, the old code would have 3 blocks and we will have 2 blocks. I wrote some code to specialize for this case, but I think it will add complexity and @ngimel mentioned that non-powers of 2 are rare enough.

I am including the regressions here for completeness' sake:

<img width="1500" alt="image" src="https://github.com/user-attachments/assets/31c17cfb-ed9b-4106-b9c8-5c359751f530" />

To see this better:

1. Click the image
2. Right click the expanded image and open in a new tab
3. Go to that tab and left click once to zoom in

If you want to see the full data, here it is:

![image](https://github.com/user-attachments/assets/54fb60c9-8c0c-4530-a1dd-79ecda1a69a1)

I also measured binary size and compile time since those are important for developers:

Binary size comparison

![image](https://github.com/user-attachments/assets/ceef5073-1036-47f6-b9dc-cea088beda51)

```
# Original
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so

# This PR
-rwxr-xr-x 1 ahmads users 307193112 Mar  6 08:46 ./torch/lib/libtorch_cuda.so
```

The diff in bytes is 302kB which is about a 0.1% increase.

Compile time difference:

```
# Original

real    0m10.931s
user    0m9.676s
sys     0m1.004s

# this PR

real    0m16.720s
user    0m15.514s
sys     0m1.066s

# Command I ran
time /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUNFUSE_FMA -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/third_party/flash-attention/csrc/flash_attn/src -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/layer_norm_kernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o

```

So the new PR is 6 seconds longer compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148605
Approved by: https://github.com/ngimel
2025-03-27 03:01:53 +00:00
b2b9aaf0ad Fix non-strict export doesn't turn on dynamo for hop (#149903)
Somehow the torch._dynamo.is_compiling is changed to torch.compiler.is_compiling(), which also checks whether we're exporting. This is not caught by cI because we don't have an export test for scan.

Changing to torch.compiler.is_dynamo_compiling and added a test.

edit: piggyback the re-tracing support in this PR. Related code in combine_fn_is_normalized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149903
Approved by: https://github.com/zou3519
2025-03-27 02:38:05 +00:00
dad0854d48 meta registration for torch._scaled_mm with mxfp8 (#148461)
Summary:

Adds the meta registration logic for torch.compile to work with
`torch._scaled_mm` with mxfp8.  Thanks to @eellison  for the pointer to make inductor work with this.

Test Plan:

```
pytest test/test_matmul_cuda.py -k test_blockwise_mxfp8_compile -s
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148461
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-03-27 02:32:40 +00:00
d5593ea31c Introduce guard_or_true, guard_or_false (#148430)
some context in this document:
https://docs.google.com/document/d/18nJsj-F2C_QXO7ClwzPcAUENQ-B440B43W7DdDnlDt4/edit?tab=t.0#heading=h.pgebnyi7pocj

But TLDR;
`guard_or_true`, `guard_or_false` are better than `guard_size_oblivious` due to :
- Easier to reason about what assumptions we are making while reading the code.
- Avoid size_oblivious complexity that is not needed.
- Avoid unsoundness that could make `guard_size_oblivious(a==1)` be true when its not true for some vaue `a` during runtime.
- Less data dependent errors for some cases: ex, when doing `guard_size_oblivious(a==1)` and we know `a` is a tensor size, if it's traced with `a=u1-u2` `guard_size_oblivious(a==1)` will throw a data dependent error but `guard_else_false` will just return `False`.

### How is it different from statically_known_true??
**`if(cond)`:** (normal guarding) will try to evaluate statically and guard on the condition, willing to restrict input space to evaluate cond. if it fails to evaluate due to data dependent error will throw an exception (that could be converted to graph break in some situations).

**`statically_known_true(cond)`:** would be used when you never want to add a guard (restrict your input space), but just want to do a best effort check to see if you can infer that something is true/false ONLY based on existing constraints.

**`guard_or_true(cond)`/`guard_or_false(cond)`:** Those would be used in situations you prefer to guard and know the result of the expression over not guarding, but in case you hit a data dependent error you are ok with just returning true or false.
Some reasons you might be ok with returning true/false instead could be:
1. It's an optimization I do not want to fail for not performing optimization.
2. I am willing to deviate from the normal semantics when I have unbacked for the benefit of not failing (See the doc above for more details).

**`definitely_true(cond)`**: same as `guard_or_false(cond)` except does not try to do static eval for unbacked (planning to deprecate it and replace uses with `guard_or_false` or make it alias to `guard_or_false`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148430
Approved by: https://github.com/bobrenjc93
2025-03-27 02:22:20 +00:00
c2b8fead43 Allow TritonTemplate subclasses to override kernel type (#150018)
Allows subclasses of `TritonTemplate` to override the kernel type, e.g.
```
class MyTritonTemplate(TritonTemplate):
    kernel_type = MyTritonTemplateKernel
```

This means that all of the logic in `TritonTemplate` class doesn't need to be duplicated in subclasses if the only required change is the kernel type.

Note that there is precedent for doing this - see `SIMDScheduling` in `torch/_inductor/codegen/simd.py`:

```
class SIMDScheduling(BaseScheduling):
    kernel_type: type[Any] = SIMDKernel  # override in subclass
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150018
Approved by: https://github.com/jansel
2025-03-27 02:16:40 +00:00
8d1cfb63b5 [export] Save unflattened gm (#150030)
Summary: Reland of D71082652

Test Plan:
https://www.internalfb.com/intern/testinfra/testrun/8444249558423545
https://www.internalfb.com/intern/testinfra/testrun/7318349652864293
https://www.internalfb.com/intern/testinfra/testrun/13229323980143778
https://www.internalfb.com/intern/testinfra/testrun/11540474119884081

Differential Revision: D71902033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150030
Approved by: https://github.com/pianpwk
2025-03-27 02:01:51 +00:00
128b32f363 cache loaded python modules (#149910)
I am splitting caching the loading of modules from the caching the codegen since its trivial and much easier.
Module loading is 50% of the cost, and codegen is 50%  of maybe_append choice on full graph model. which is 40% of total compile time.

<img width="434" alt="Screenshot 2025-03-24 at 4 35 12 PM" src="https://github.com/user-attachments/assets/aa851c6a-bde9-43f8-b12d-e439504ef62c" />

running mm_loop benchmark,
before this change:
67947323682

after this change:
25845073249

2.6X faster.

it seems that the cache was there then got dropped. I added benchmark so it wont be dropped again by mistake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149910
Approved by: https://github.com/eellison, https://github.com/aorenste
ghstack dependencies: #149932
2025-03-27 00:45:09 +00:00
48cff64a54 [pt2_provenance_tracing] add combo kernel nodes post_grad nodes origin info (#149598)
Summary: found it helpful when running prod model with combo_kernel feature enabled

Test Plan: CI

Differential Revision: D71513304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149598
Approved by: https://github.com/yushangdi
2025-03-27 00:26:24 +00:00
731b559f54 [easy] Use config patch to toggle capture_scalar_output (#150036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150036
Approved by: https://github.com/angelayi
ghstack dependencies: #149087, #149667
2025-03-27 00:01:39 +00:00
999fa15ba8 [invoke_subgraph][fake tensor cache] Add a finalizer for id hashed objects (#149667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149667
Approved by: https://github.com/zou3519
ghstack dependencies: #149087
2025-03-27 00:01:39 +00:00
a7596b4b34 [invoke_subgraph] Fake tensor prop caching (#149087)
Redoing https://github.com/pytorch/pytorch/pull/137808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149087
Approved by: https://github.com/zou3519
2025-03-27 00:01:39 +00:00
3efa211e48 [ONNX] Annotate None inputs in symbolic ops (#150038)
Add `None` to type annotations of `torch.onnx.ops.symbolic*` ops and improve tests to test support for optional inputs. Previously it was omitted mistakenly even though the implementation supports it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150038
Approved by: https://github.com/titaiwangms
2025-03-27 00:01:09 +00:00
6db95ccf4c Delete linux-focal-cuda12_6-py3_10-gcc11-bazel-test (#150066)
It's been broken for a while even when this jobs were still called ` linux-focal-cuda12.4-py3.10-gcc9-bazel-test`
Last time it run successfully on Feb 21st

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150066
Approved by: https://github.com/yangw-dev, https://github.com/seemethere, https://github.com/atalman
2025-03-26 23:55:58 +00:00
43cc954f88 Refactor row-wise scaled MM (#149978)
1. Add config selection for SM89.
2. Only build kernels if compiling for given arch.
3. Factor out CMake code to enforce compiling for needed archs for individual files into a function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149978
Approved by: https://github.com/drisspg
2025-03-26 23:49:41 +00:00
6aca002d82 [MPS] Add chebyshev_polynomial_[uvw] (#150060)
For both eager and inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150060
Approved by: https://github.com/dcci, https://github.com/jansel
2025-03-26 23:35:05 +00:00
185aaaaf8e Revert "Improve subproc autotuning implementation (#149700)"
This reverts commit 8cd6a133f21821f0713116f0f9a55e5368de8c1c.

Reverted https://github.com/pytorch/pytorch/pull/149700 on behalf of https://github.com/yangw-dev due to This is breaking servicelab_benchmark_pyper_local_runner internally ([comment](https://github.com/pytorch/pytorch/pull/149700#issuecomment-2755975959))
2025-03-26 23:17:01 +00:00
db8f4c1b1b [MPSInductor] Run chebyshev_polynomial_t tests (#150042)
Test name should start with `test_`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150042
Approved by: https://github.com/dcci
2025-03-26 22:50:08 +00:00
9aa0612dd3 [targets2buck] Remove tombstone messages proactively (#147897)
Summary:
X-link: https://github.com/pytorch/executorch/pull/8703

Originally we created a bunch of empty `TARGETS` files to allow us to enable `BUCK` files in fbcode by hiding the existing BUCK file. These files were subsequently merged together using `non_fbcode_target` so these tombstones are no longer necessary.

This diff fixes all files that WOULD have had the useless tombstone merged into them. To create this diff, I just ran the merger script that Codemod Service is using and then deleted the "merged from" and tombstone lines with `sed`, `arc f` and reverted any lines that didn't make sense

Test Plan: CI

Differential Revision: D69994481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147897
Approved by: https://github.com/izaitsevfb
2025-03-26 22:15:17 +00:00
c0af782f30 [ROCm] Change LoadHIP to use find_file for rocm_version.h (#149983)
Fixes #149805

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149983
Approved by: https://github.com/jeffdaily
2025-03-26 21:26:41 +00:00
625913eefc [MTIA] [Triton] Set codename of MTIA device in triton heuristics (#149860)
Summary: Triton-MTIA expects the codename of the device as the arch when querying the module map, not the compute capability. This diff gets rid of the following error: `No libdevice is provided for arch (0, 0)`

Test Plan: CI

Reviewed By: Myrthan

Differential Revision: D70072095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149860
Approved by: https://github.com/jansel
2025-03-26 20:58:12 +00:00
87bfd66c3c gloo: update to latest version (#149985)
This updates submodule Gloo to the latest version and brings a number of benefits:

* connection retries d2609ab5e8
* better error messages 5ca057d6cc
* multi_get support for larger scale jobs 4ff6edf45f
* metadata exchange optimizations  20dc202dd8
* miscellaneous other fixes

Old commit: 5354032ea0

Test plan:

This is already being used in production environments at scale.

PyTorch CI

```
pytest -v test/distributed/test_c10d_gloo.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149985
Approved by: https://github.com/fduwjj, https://github.com/malfet
2025-03-26 19:19:31 +00:00
039ebdc192 [Graph Partition] Support symbol inputs (#149458)
This PR supports symbol inputs to graph partition functions. Before this PR, we rely on `node.read_writes` to get partition inputs. However, this does not cover symbol inputs.

In this PR, for each graph partition, we collect all symbol inputs which are required to be in scope to successfully         perform codegen, including:
- free symbols used in partition nodes.
- free symbols in partition input/node shapes, strides, and offsets. This is needed for recording cudagraphs for tensors with dynamic shapes.

### Note1: MutationLayout
In this example, node.layout is MutationLayoutSHOULDREMOVE. The symint from index `n` does not appear in the size, offset, stridese of node.layout. This symint appear in node.layout.target. So we need extra handle for it.

```python
x = torch.zeros(7, device="cuda")

def fn(n, a):
    a[n] = -1
    return a

opt_fn = torch.compile(fn, fullgraph=True)

for n in range(2, x.shape[0]):
    opt_fn(n, x)
```

### Note2: Composability with Padded Tensor Subclass

W/o graph partition, Padded Tensor subclass lifts outer shapes to input arguments (i.e., arg0_1 for s0, arg1_1 for s1) but does not lift inner shapes (i.e., s2 and s3). Since cudagraph cache relies on integer inputs, it will cache on outer shapes and ignore inner shapes, which is bad.

```
def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args
    args.clear()
    s0 = arg0_1
    s1 = arg1_1
    arg2_1_size = arg2_1.size()
    s2 = arg2_1_size[0]
    s3 = arg2_1_size[1]
    assert_size_stride(arg2_1, (s2, s3), (s3, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((s2, s3), (s3, 1), torch.float32)
        # Topologically Sorted Source Nodes: [x1, mul], Original ATen: [aten.add, aten.mul]
        triton_poi_fused_add_mul_0_xnumel = s2*s3
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_mul_0.run(arg2_1, buf0, triton_poi_fused_add_mul_0_xnumel, stream=stream0)
        del arg2_1
    return (buf0, s0, s1, s1, )
```

w/ graph partition, the partition function only includes tensor and inner shapes as inputs, to make sure the cudagraph caching is correct. Full Comparison: [code](https://www.internalfb.com/intern/diffing/?paste_number=1761674743)
```python
   def call(self, args):
        arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args
        args.clear()
        s0 = arg0_1
        s1 = arg1_1
        arg2_1_size = arg2_1.size()
        s2 = arg2_1_size[0]
        s3 = arg2_1_size[1]
        assert_size_stride(arg2_1, (s2, s3), (s3, 1))
        partition0_args = [arg2_1, s2, s3]
        del arg2_1
        (buf0,) = self.partitions[0](partition0_args)
        del partition0_args
        return (buf0, s0, s1, s1, )
```

The number of cudagraphs is validated below: (also added to test)
```python
import torch

from padded_tensor import PaddedTensor

# Turning off graph_partition leads to
# torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=6
# at the end, which is wrong.
# torch._inductor.config.graph_partition = False

# Turning on graph_partition leads to
# torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=4
# at the end, which is correct.
torch._inductor.config.graph_partition = True

def f(x):
    x1 = x + 1
    return x1 * 2

compiled_f = torch.compile(f, mode="reduce-overhead")

def run(shape):
    x = torch.randn(*shape, device="cuda")
    pad_x = PaddedTensor.from_tensor(x, multipliers={0:4, 1:4})
    assert hasattr(pad_x, "multipliers"), breakpoint()
    eager_out = f(pad_x)

    for _ in range(3):
        compiled_out = compiled_f(pad_x)
    compiled_out = compiled_f(pad_x)

    assert eager_out.shape == compiled_out.shape
    assert eager_out.tensor.shape == compiled_out.tensor.shape
    assert torch.allclose(eager_out.tensor, compiled_out.tensor)

# static shape. record a NEW cudagraph. 1 cudagraph in total now.
run((2,3))
# outer shape is dynamic, leading to a new dynamo graph
# this new dynamo graph forces a NEW cudagraph. 2 cudagraphs in total now
run((3,4))
# outer shape changed but inner shape does not change
# so NO new cudagraph is recorded
run((2,2))
# inner shape is dynamic now, leading to a new dynamo graph
# this new dynamo graph forces a NEW cudagraph. 3 cudagraphs in total now
run((5,6))
# does NOT record a new cudagraph
run((7,8))
# record a NEW cudagraph. 4 cudagraphs in total now
run((10,11))

assert torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149458
Approved by: https://github.com/eellison
2025-03-26 17:21:30 +00:00
4a9466c96a Newer conda versions require --update-deps to update dependencies such as libgcc-ng (#149599)
* When we try to install [libstdcxx-ng 12.3.0 from conda-forge](595293316d/.ci/docker/common/install_conda.sh (L65)), conda 24.7.1 updates the dependencies of that package, including libgcc-ng package to the following:  `libgcc-ng-14.2.0 | h69a702a_2 52 KB conda-forge`

* However, conda updated their installer script on Feb 6 2025 to version 25.1.1, which behaves differently from previous versions when installing conda packages.

* conda 25.1.1 does *not* update any dependencies in the above step, and hence the same installation of libgcc-ng from "defaults" channel is present: `libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1`

* Adding the "--update-deps" flags to the conda install command installs a newer libgcc-ng package from the "conda-forge" conda channel:  `libgcc-ng-12.3.0 | h77fa898_13 762 KB conda-forge`, which is compatible with the libstdcxx-ng 12.3.0 package

* Compare this [Feb 4 docker build](https://github.com/pytorch/pytorch/actions/runs/13148456164/job/36691412387#step:6:5179) to this [Feb 10 docker build](https://github.com/pytorch/pytorch/actions/runs/13247023578/job/36975931849#step:6:5451), which shows that the latter does *not* update libgcc-ng.

* This creates linking issues when trying to use a library, that was built with a newer libgcc_s.so.1 (from libcc-ng package), in the PyTorch conda environment. Eg. ONNX-RT:
```
[0;93m2025-02-13 10:18:38.492434704 [W:onnxruntime:Default, migraphx_execution_provider.cc:167 get_flags_from_env]
[MIGraphX EP] MIGraphX ENV Override Variables Set:
2025-02-13 10:18:38.628064251 [E:onnxruntime:Default, provider_bridge_ort.cc:2028 TryGetProviderInfo_ROCM] /onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_rocm.so with error: /opt/conda/envs/py_3.10/bin/../lib/libgcc_s.so.1: version `GCC_12.0.0' not found (required by /opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime_providers_rocm.so)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149599
Approved by: https://github.com/malfet
2025-03-26 17:04:21 +00:00
b2088f1afe Add inductor test for torchbind symint (#149980)
Summary: add test

Test Plan:
```
buck run //caffe2/test:test_export -- -r test_compile_custom_obj_unbacked_symint
```

Differential Revision: D71843179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149980
Approved by: https://github.com/BoyuanFeng
2025-03-26 17:02:55 +00:00
a0253d2840 [Inductor] Use real input to autotune user defined triton kernels (#149553)
Summary:
User defined Triton kernel sometimes rely on real inputs to determine
the path of execution. We need real inputs to invoke the correct
behavior of the user defined triton kernels (see example in test case,
where we have an early return for random inputs)

Test Plan:
Included in the commit.
python test/inductor/test_aot_inductor.py -k triton_autotuning
python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-03-26 16:42:48 +00:00
3a8171efad [MPS] Preserve in/out dtypes in binary_op name (#150024)
To be consistient with unary op and avoid silent correctness problems if someone will try to invoke the op with unexpected out dtype
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150024
Approved by: https://github.com/dcci
2025-03-26 16:00:43 +00:00
32299e5f9a Reland "Introduce new template heuristic for triton autotune configs" (#147452)
This change was reverted in https://github.com/pytorch/pytorch/pull/147388 for regressing an internal workload.

I have removed the additional ir.device_type calls in mm_scaled and unpack_mixed_mm.py which could be contributing to the additional compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147452
Approved by: https://github.com/jansel
2025-03-26 15:47:06 +00:00
7336b76bcc Refactor cudnn version check in smoke test for Windows (#150015)
After https://github.com/pytorch/pytorch/pull/149885

I see failures on Window smoke test:
https://github.com/pytorch/test-infra/actions/runs/14069923716/job/39401550854

Due to fact that pypi packages such as cudnn and nccl are installed only on Linux. Hence this should resolve issue on Windows platform.
On windows cudnn is shipped with PyTorch as opposed to installed dynamically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150015
Approved by: https://github.com/ZainRizvi
2025-03-26 15:15:46 +00:00
8a40fca9a1 Support huggingface reading and writing for multi rank case (#148189)
Summary: This diff adds the ability for HF reader/writer to read/write in a distributed way. We do this by sending all the tensors meant for the same file to the same rank.

Test Plan:
ensure existing tests pass
I also ran a full end to end test on my devserver to read/write from my HF repo

Differential Revision: D70096439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148189
Approved by: https://github.com/joecummings, https://github.com/saumishr
2025-03-26 14:47:31 +00:00
0c139fa58e Switch s390x tests to blocklist (#149507)
Switch s390x tests to blocklist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149507
Approved by: https://github.com/seemethere
2025-03-26 12:11:41 +00:00
7379c66344 add loop mm benchmark (#149932)
results:
compile time instruction count for iteration 4 is 67947323682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149932
Approved by: https://github.com/bobrenjc93, https://github.com/eellison
2025-03-26 11:21:30 +00:00
cyy
79e8a69257 Enable move warnings for torch targets (#149923)
This PR enables more move warnings for torch targets and fixes some code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149923
Approved by: https://github.com/malfet
2025-03-26 08:38:13 +00:00
de68ddc68e [MPS] Fix metal ops with different dtypes (#149974)
By implementing `_cast_` flavors of both dense and strided ops. Add regression tests that tests `fmax`/`fmin` for mixed dtypes.

Been dreaded to write this PR for a while, as it end up to be pretty bulky:
 - Adds 1C10_METAL_ALL_TYPES_FUNCTOR` and `c10:🤘:ScalarType` to `c10/metal/common.h` and test that its values always match `c10::ScalarType`
 - Add `c10:🤘:cast_to` to `c10/metal/utils.h` which could be used to cast any scalar metal dtype to any other one, including complex values
 - Implement `val_at_offs<T>(constant void *, long offs, ScalarType dtype)` that is used to dynamically cast types
 - Add `binary_strided_cast` and `binary_dense_cast` that are invoked for output dtype and cast both inputs to that output before performing the op

Benchmark collected on M2Pro that runs fmax for 1 mln element tensors (Times are in microseconds.)

|                                           |  dense-dense  |  transp-transp  |  dense-transp  |  transp-dense  |  dense-scalar  |  dense-bcast |
|-------------------------|---------------|----------------|----------------|----------------|---------------|--------------- |
|      fmax (torch.float16, torch.float16)  |     160.9     |      159.9      |     270.5      |     270.9      |     236.6      |     293.0
|      fmax (torch.float32, torch.float32)  |     176.9     |      171.0      |     273.7      |     293.5      |     242.6      |     294.2
|      fmax (torch.float32, torch.float16)  |     171.4     |      170.9      |     283.6      |     303.0      |     253.7      |     302.3
|      add (torch.float16, torch.float16)   |     218.0     |      223.6      |     221.0      |     222.0      |     214.9      |     218.3
|      add (torch.float32, torch.float32)   |     227.4     |      233.9      |     228.8      |     231.9      |     218.9      |     221.4
|      add (torch.float32, torch.float16)   |     226.1     |      227.5      |     227.5      |     226.9      |     177.0      |     190.8

TODOS:
 - Include input and output dtype in non-cast kernel name
 - Make TensorFactory.h use `C10_METAL_ALL_TYPES_FUNCTOR`
- Extend mixed_dytpes testing via OpInfo

Fixes https://github.com/pytorch/pytorch/issues/149951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149974
Approved by: https://github.com/manuelcandales
2025-03-26 07:03:21 +00:00
aa575cab71 Skip cxxabi check for s390x (#149954)
On s390x gcc 14 is used because it contains fix for interaction between precompiled headers and vectorization builtins. This fix is not available in earlier gcc versions. gcc-14 uses ABI19, but check still fails, so skip it for now..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149954
Approved by: https://github.com/cyyever, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-26 06:50:27 +00:00
6ae8eb881c [ONNX] Clean up the diagnostics module (#149864)
Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864
Approved by: https://github.com/titaiwangms
2025-03-26 05:58:32 +00:00
d256b2dcb2 Revert "[custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555)"
This reverts commit d686d04c2f3bac110044ebad5cc46e3035d7b425.

Reverted https://github.com/pytorch/pytorch/pull/148555 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148555#issuecomment-2753283221))
2025-03-26 05:27:52 +00:00
819b23e0b4 Support None return type in torchbind and Add more AOTI torchbind e2e tests (#149749)
Summary:
- Add more tests for torchbind in aoti

**FallBackKernel**
- In FallbackKernel.find_device, do not check the device of torchbind obj because they don't have a fixed "device"
- If no device found for CallTorchBindObject, use cpu
- handle None output in `export_extern_kernel_node`

Test Plan:
```
buck run //sigmoid/inference/test:e2e_test_cpu -- -r CustomClassHolderConstantDynamic
```

Differential Revision: D70746626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149749
Approved by: https://github.com/desertfire
2025-03-26 04:20:14 +00:00
71acb1bb42 [inductor] Fix division by zero error in fractional max (#148729)
Fixes https://github.com/pytorch/pytorch/issues/148152
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148729
Approved by: https://github.com/eellison
2025-03-26 04:18:50 +00:00
eqy
9108d153ce [CUDA]][SymmetricMemory] Interpret empty string as std::nullopt in rendezvous (#149793)
this is a "temporary" fix as current internal API requires strings at some interfaces instead of `std::optional` and empty strings are presumably used in-lieu of `nullopt`.
e.g.,
9d02b3993f/torch/csrc/distributed/c10d/intra_node_comm.cu (L49)

this currently breaks `test_intra_node_comm_all_reduce`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149793
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-03-26 03:59:43 +00:00
ab9ca6b31f Revert "[inductor] Fix mm logging for torch._scaled_.mm (#149967)"
This reverts commit 661d74bf4483e19e158c41b55d47f02eb9fdcc21.

Reverted https://github.com/pytorch/pytorch/pull/149967 on behalf of https://github.com/malfet due to This broke ROCM testing, see 45b11730f1/1 ([comment](https://github.com/pytorch/pytorch/pull/149967#issuecomment-2753149024))
2025-03-26 03:29:59 +00:00
45b11730f1 [ROCm][TunableOp] TunableOp Context Manager for unit tests (#149930)
This PR is cleanup only. There are no feature changes or bug fixes.

We create a TunableOp context manager for setting up and cleanup. We re-write TunableOp unit tests in terms of this context manager. Ultimately reduces the amount of copy-paste code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149930
Approved by: https://github.com/jeffdaily
2025-03-26 02:59:58 +00:00
a8d0c5c928 [inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149973)
Fixes #148938

Context:

In triton 3.3, triton kernels expect a global scratch space arg to be passed in. This is fixed in #148051, which fixed most of the AOTI/cpp_wrapper failures; the fix is to inject a (null) global scratch space arg passed as an argument to all kernels.

But in the case of TMA, we need to call a non-triton-generated function - init1DTMADescriptor. The same `generate_args_decl` function used for calling triton kernels (and modified in #148051 to insert a global scratch space) is used to prepare the arguments to init1DTMADescriptor, and so it had an extra global scratch space arg. Then we'd get a null pointer passed into init1DTMADescriptor, resulting in an IMA later on when the TMA use kernel

This PR: adds an option to `generate_args_decl` to specify whether this is a triton kernel (in which case we should add the global scratch space arg) or not (when we shouldn't add the extra arg).

Note: this doesn't appear in CI because we don't run these tests with Hopper machines in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149973
Approved by: https://github.com/drisspg
2025-03-26 00:12:02 +00:00
1b373f6cd4 Revert "cpp_wrapper: Fix even more tests (#147225)"
This reverts commit 62d351a35b1bd961afbd09057beec14ff201c41d.

Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))
2025-03-26 00:03:13 +00:00
91bf92597c Revert "cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)"
This reverts commit 0de70fbbe73d2109497cd57ed5402e0cf9450f18.

Reverted https://github.com/pytorch/pytorch/pull/149350 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))
2025-03-26 00:03:13 +00:00
3c85784980 Fix broken LazyLinear init (#149693)
Fixes #149691

I beleive it does not impact negatively the fix in https://github.com/pytorch/pytorch/pull/147599 as the tests stilll pass but @FFFrog should confirm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149693
Approved by: https://github.com/mikaylagawarecki, https://github.com/FFFrog, https://github.com/malfet
2025-03-25 23:49:49 +00:00
661d74bf44 [inductor] Fix mm logging for torch._scaled_.mm (#149967)
Summary:
This pr is just for recreation of the original pr: https://github.com/pytorch/pytorch/pull/149769

Fix for `torch._scaled_mm` op mm logging,  which breaks the original brittle underscore parsing
assumptions.

Test Plan: CI

Differential Revision: D71828732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149967
Approved by: https://github.com/vkuzo
2025-03-25 23:38:35 +00:00
c05328e01a [ROCm] fix uninitialized warning in BFloat16.h (#149868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149868
Approved by: https://github.com/jeffdaily, https://github.com/cyyever
2025-03-25 23:36:10 +00:00
36eb64d60e [ROCm] missing AT_CUDA_CHECK for cub and SoftMax (#149883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149883
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
2025-03-25 23:22:32 +00:00
eqy
de73790fe6 [cuDNN][SDPA] cuDNN SDPA supports head_dim <= 256 on sm90 and sm100 as of 9.5.1+ (#149904)
gqa check PR will go next...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149904
Approved by: https://github.com/drisspg
2025-03-25 23:10:16 +00:00
68b327341c Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808)
@pytorchbot label "bug"

Fixes #149806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149808
Approved by: https://github.com/jansel
2025-03-25 23:03:47 +00:00
ce54c430c0 [Submodule] [cpuinfo] cpuinfo update (#149305)
Updating `cpuinfo` module.

Relevant:
https://github.com/pytorch/cpuinfo/issues/270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149305
Approved by: https://github.com/malfet
2025-03-25 22:44:50 +00:00
feb503c1df [AOTInductor] Refine error message for dlopen in AOTInductor (#149812)
Summary:
Refine the error message if dlopen failed in AOTInductor.
The original error message was ominous, modified to recommend user to
rebuild AOTInductor if needed, otherwise it's fine.

Test Plan:
None. Error message change.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149812
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-25 21:45:10 +00:00
0159f8ed54 [ROCm] build magma rocm and upload tarball (#149902)
This will improve docker image build times by not having to rebuild magma rocm for unrelated changes.  This PR is step 1 of 2.  The next step is a second PR to modify the docker image builds to use the magma tarball that this PR will produce.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149902
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-25 21:37:13 +00:00
d3b7cf7b7d Revert "[ROCm] build magma rocm and upload tarball (#149902)"
This reverts commit bf8f4efd3158204592643e6cf26889fff5afcee2.

Reverted https://github.com/pytorch/pytorch/pull/149902 on behalf of https://github.com/seemethere due to This is currently breaking lint see [GH job link](https://github.com/pytorch/pytorch/actions/runs/14069330750/job/39399569526) [HUD commit link](bf8f4efd31) ([comment](https://github.com/pytorch/pytorch/pull/149902#issuecomment-2752594578))
2025-03-25 21:33:00 +00:00
e85ce64bde [MPS/Inductor] Add support for chebyshev_polynomial_t. (#149928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149928
Approved by: https://github.com/malfet
2025-03-25 21:02:13 +00:00
6c9d48b32b refresh results of benchmarks (#149936)
while the test was disabled, I put a fix but another win change landed before the test was restored
to it stayed disabled.
<img width="698" alt="Screenshot 2025-03-24 at 6 26 36 PM" src="https://github.com/user-attachments/assets/2713c685-aee2-4dea-9a6c-cad01ef575cd" />
caused by
https://github.com/pytorch/pytorch/pull/149295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149936
Approved by: https://github.com/bobrenjc93
2025-03-25 21:01:08 +00:00
90110b069f Use statically known true in should_decompose_mm (#149950)
This meta function is causing recompiles for large ads runs due to overguarding: https://www.internalfb.com/ai_infra/job_inspector/guided/pt2_compile?jobName=aps-ig_fm_v4_pt2_on-6e0a734dcc&jobVersion=0&jobAttempt=0

If we look at the reasons, it's because of this function adding guards: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-ig_fm_v4_pt2_on-6e0a734dcc/attempt_0/version_0/rank_0/-_18_8_0/recompile_reasons_1971.json?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

This PR moves to statically_known_true so we don't overly guard for dynamic shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149950
Approved by: https://github.com/mengluy0125
2025-03-25 20:40:00 +00:00
ce3dc9e346 add some extra test oom skips for jetson due to lacking nvml support (#149587)
Add a couple of Jetson skips for oom tests in test/test_cuda.py due to failures in nvidia CI. Jetson not having full nvml support is a known issue so this is mostly a test side fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149587
Approved by: https://github.com/eqy
2025-03-25 20:39:10 +00:00
b562d22772 test/test_cuda.py: rework TEST_PYNVML logic to make more sense, add not IS_JETSON condition (#149578)
PYNVML related tests in test/test_cuda.py are failing in nvidia internal CI for Jetson devices because Jetson devices don't fully support nvml (it exists as a stub library). In addition to skipping PYNVML tests for Jetson, this PR also reworks the TEST_PYNVML logic a bit to be more consistent with the rest of TEST_{something} conditions in test/test_cuda.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149578
Approved by: https://github.com/janeyx99, https://github.com/eqy
2025-03-25 20:38:15 +00:00
12628ba24d [AOTInductor] Bug fix for freeing buffers when freeing multiple times (#149810)
Summary:
We might free the active buffer if we free the buffer twice.

Test Plan:
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810
Approved by: https://github.com/chenyang78
2025-03-25 20:26:36 +00:00
bf8f4efd31 [ROCm] build magma rocm and upload tarball (#149902)
This will improve docker image build times by not having to rebuild magma rocm for unrelated changes.  This PR is step 1 of 2.  The next step is a second PR to modify the docker image builds to use the magma tarball that this PR will produce.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149902
Approved by: https://github.com/malfet
2025-03-25 20:20:36 +00:00
d1ff3ff675 [Bugfix] Add handling for buffer overrides (#149882)
Fixes #139167

This PR:
* uses `named_buffers` to mark static
* Checks that `named_buffers` is of expected type (callable, iterator) before trying to iterate over; if not, we skip this pass

These changes fix the previous errors in dynamo causing to crash (as shown in issue above)

### Unit Test
```
python test/dynamo/test_buffers_override.py
```

Results in:
```
.
----------------------------------------------------------------------
Ran 2 tests in 5.344s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149882
Approved by: https://github.com/anijain2305
2025-03-25 20:12:43 +00:00
8cd6a133f2 Improve subproc autotuning implementation (#149700)
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.

One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`

List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.

Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
2025-03-25 20:07:28 +00:00
30e8be599f Revert "[ONNX] Clean up the diagnostics module (#149864)"
This reverts commit cc6e300fe225ac7f34f37494639b061ef45ceeec.

Reverted https://github.com/pytorch/pytorch/pull/149864 on behalf of https://github.com/malfet due to This indeed broke Mac testing see 1c98dc3664/1 ([comment](https://github.com/pytorch/pytorch/pull/149864#issuecomment-2752317873))
2025-03-25 19:31:50 +00:00
1c98dc3664 [dynamo] Fix handling of setattr with some tensor attributes (#149791)
We weren't handling `setattr(tensor_obj, "real", 42)` correctly, because
the attribute is a `GetSetDescriptorType` that has special setter logic.
See added test and comments for more explanations.

This patch makes it so that we graph break in those cases, rather than
resulting in silent incorrectness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149791
Approved by: https://github.com/mlazos
ghstack dependencies: #149481
2025-03-25 18:57:56 +00:00
0de70fbbe7 cpp_wrapper: precompile a few more commonly used headers, and improve RAIIPyObject interface (#149350)
Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes #142005.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149350
Approved by: https://github.com/desertfire
ghstack dependencies: #146706, #147225
2025-03-25 17:58:40 +00:00
62d351a35b cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
ghstack dependencies: #146706
2025-03-25 17:58:40 +00:00
0f1aaeb62e cpp_wrapper: persist autotune example tensors until last use (#146706)
Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py` when run with compile-time autotuning, `test_comprehensive_nanquantile_cuda_float64`.

For clarity, the situation triggering this PR looks like kernels `A -> BCDE -> F` (`BCDE` is fused), where one of the outputs from `A` is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them to `BCDE`, so that they no longer matched. This caused a `tl.device_assert` call in `BCDE` to fail. With this PR, we reuse the random data input to `A` and the output Boolean tensor, such that they match and pass the device assertion in `BCDE`.

Fixes #147799.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146706
Approved by: https://github.com/desertfire
2025-03-25 17:58:40 +00:00
8d1db7f39d [MPS][BE] Add c10/metal/common.h (#149955)
That could be shared between host and metal code
So far put only one constant, which is a maximum number of tensor dimentions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149955
Approved by: https://github.com/Skylion007, https://github.com/manuelcandales
2025-03-25 17:37:24 +00:00
cc6e300fe2 [ONNX] Clean up the diagnostics module (#149864)
Remove the diagnostics/SARIF module from ONNX exporter because it is obsolete unused.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149864
Approved by: https://github.com/titaiwangms
2025-03-25 16:58:46 +00:00
84ae056d82 [invoke_subgraph] Support pending unbacked symint (#149297)
The "PendingUnbackedSymbolNotFound" error is when an unbacked symbol is created within a piece of code, but this symbol never appears in any of the outputs. I believe the original intention is to help catch incorrectly written meta kernels, where users might've unintentionally created an unbacked symbol but never used it anywhere, but in our case this is intentional. An example is the following test case:

```python
    def test_pending_unbacked(self):
        class M(torch.nn.Module):
            @mark_compile_region
            def gn(self, x):
                u = x[0].item()
                return x * u

            def forward(self, x):
                for _ in range(4):
                    x = self.gn(x)
                return x

        torch._dynamo.config.capture_scalar_outputs = True
        torch.compile(M())(torch.randn(8))
```

This fails with the error:
```
torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {zuf1} not in returned outputs (FakeTensor(..., size=(8,)),) .
```

In this case, creating the unbacked symbol is intentional, so we can bypass this using `fake_mode.shape_env.ignore_fresh_unbakced_symbols()`.

Differential Revision: [D71298926](https://our.internmc.facebook.com/intern/diff/D71298926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149297
Approved by: https://github.com/zou3519
ghstack dependencies: #149296
2025-03-25 16:42:58 +00:00
8be1bf1dbb [export] Add mark_compiled_region support (#149296)
Differential Revision: [D71298930](https://our.internmc.facebook.com/intern/diff/D71298930)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149296
Approved by: https://github.com/zou3519
2025-03-25 16:42:58 +00:00
5c19952c83 cd: Restore windows release builds for libtorch (#149863)
These were accidentally deleted in the refactor of DEVTOOLSET +
cxx11abi.

This happened because the `build_environment` variable wasn't aware of the `build_variant` for libtorch and subsequently overwrote the original file twice, leaving the last written as the actual workflow (which in this case was the debug builds).

One thing this has made me curious on is if we actually need `debug` builds for window at all? We don't release them for linux and I'd probably bet that they have low download numbers anyways so maybe it makes sense to cut them.

Adds a build_variant parameter to the dataclass so that we can extend
these easily in the future if we want.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149863
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-25 16:23:59 +00:00
f0ca0d45a6 [CI] Add MacOS-M2-15 as MPS test target on trunk (#149900)
Now that we have runners allocated by AWS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149900
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere
2025-03-25 16:19:35 +00:00
2cc3f5030a Add XPU and SYCL Merge Patterns (#149933)
As the title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149933
Approved by: https://github.com/atalman
2025-03-25 16:03:29 +00:00
43ee67e8dc Removing doc references to PRE_CXX11_ABI. (#149756)
Fixes #149550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149756
Approved by: https://github.com/svekars, https://github.com/atalman
2025-03-25 16:01:59 +00:00
5dca832257 Add smoke test to validate pypi env version vs torch complied and installed versions of nccl and cudnn (#149885)
Followup after nccl update to validate both cudnn and nccl versions in nightly and release pipelines.

Tested on local dev machine, output.
Success:
```
Found matching cudnn. Torch: 9.5.1 PyPI 9.5.1.17
Found matching nccl. Torch: 2.25.1 PyPI 2.25.1
```

Failure:
```
Traceback (most recent call last):
  File "test1.py", line 29, in <module>
    compare_pypi_to_torch_versions("nccl", find_pypi_package_version("nvidia-nccl"), torch_nccl_version)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/test1.py", line 24, in compare_pypi_to_torch_versions
    raise RuntimeError(
        f"Wrong {package} version. Torch: {torch_version} PyPI: {pypi_version}"
    )
RuntimeError: Wrong nccl version. Torch: 2.25.1 PyPI: 2.26.2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149885
Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/d4l3k
2025-03-25 15:57:53 +00:00
d90d83c484 [torch] Fix unsafe concurrent access to autocast_enabled (#148281)
Summary: Making autocast_enabled atomic, as it can be accessed from multiple threads

Differential Revision: D70456813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148281
Approved by: https://github.com/davidberard98
2025-03-25 14:46:12 +00:00
a2bba53f87 Improve error message when view of intermediate is returned from autograd.Function and marked dirty (#149543)
Fixes https://github.com/pytorch/pytorch/issues/149252
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149543
Approved by: https://github.com/zou3519
ghstack dependencies: #149220
2025-03-25 14:44:11 +00:00
7b218ca874 Revert "[BE] Replace XPU support packages installation to offline mode in Linux CI/CD (#149843)"
This reverts commit 86dcdf9c8bb8f69c5d28184b31ee6d7f19127d67.

Reverted https://github.com/pytorch/pytorch/pull/149843 on behalf of https://github.com/malfet due to This breaks XPU builds, see 23183fef7e/1 ([comment](https://github.com/pytorch/pytorch/pull/149843#issuecomment-2751482412))
2025-03-25 14:39:10 +00:00
29b3f409c2 [BE][CI] Update actionlint to 1.7.7 (#149919)
- fix anti-pattern started by https://github.com/pytorch/pytorch/pull/81922 when x86 actionlint binaries were placed in Linux-arm64 folder
- Fix renaming lint violations, namely
```
>>> Lint for .github/workflows/_linux-test.yml:

  Error (ACTIONLINT) [expression]
    property "workspace" is not defined in object type {arch: string; debug:
    string; environment: string; name: string; os: string; temp: string;
    tool_cache: string}

        446  |        if: failure() && steps.install-nvidia-driver.outcome && steps.install-nvidia-driver.outcome != 'skipped'
        447  |        shell: bash
        448  |        env:
    >>> 449  |          RUNNER_WORKSPACE: ${{ runner.workspace }}
        450  |        run: |
        451  |          set +e
        452  |          set -x

>>> Lint for .github/workflows/create_release.yml:

  Error (ACTIONLINT) [deprecated-commands]
    workflow command "set-output" was deprecated. use `echo "{name}={value}"
    >> $GITHUB_OUTPUT` instead: https://docs.github.com/en/actions/using-
    workflows/workflow-commands-for-github-actions

         80  |          path: ${{ env.PT_RELEASE_FILE }}
         81  |      - name: Set output
         82  |        id: release_name
    >>>  83  |        run: echo "::set-output name=pt_release_name::${{ env.PT_RELEASE_NAME }}.tar.gz"
         84  |
         85  |  upload_source_code_to_s3:
         86  |    if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}

>>> Lint for .github/workflows/target-determination-indexer.yml:

  Error (ACTIONLINT) [shellcheck]
    shellcheck reported issue in this script: SC2086:info:3:3: Double quote to
    prevent globbing and word splitting

         98  |          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
         99  |          GITHUB_RUN_ID: ${{ github.run_id }}
        100  |          AWS_DEFAULT_REGION: us-east-1
    >>> 101  |        run: |
        102  |          # detached container should get cleaned up by teardown_ec2_linux
        103  |          container_name=$(docker run \
        104  |            ${GPU_FLAG:-} \
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149919
Approved by: https://github.com/jeanschmidt, https://github.com/atalman, https://github.com/Skylion007
ghstack dependencies: #149917, #149918, #149922
2025-03-25 14:37:10 +00:00
6c7f9f7e7d [CI][BE] Update other actions (#149922)
Discovered by actionlint-1.7.7:
- `actions/checkout@v3`->`actions/checkout@v4`
- `actions/setup-python@v4` -> `actions/setup-python@v5`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149922
Approved by: https://github.com/Skylion007
ghstack dependencies: #149917, #149918
2025-03-25 14:37:10 +00:00
535885dc8d [BE][CI] Update configure-aws-credential to v4 (#149918)
Prerequisite for update to actionlint-1.7.7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149918
Approved by: https://github.com/Skylion007
ghstack dependencies: #149917
2025-03-25 14:37:02 +00:00
f63b03e9fc [BE] Add Mac ARM64 actionlint binary (#149917)
Downloaded from https://github.com/rhysd/actionlint/releases/tag/v1.6.21
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149917
Approved by: https://github.com/Skylion007
2025-03-25 14:36:54 +00:00
23183fef7e [Test] Add simple MPS op benchmarks (#149914)
Lots of benchmark tests has been posted in PRs, but they might get lost over time
So let's create a benchmark and populate it with results (preferably from the run on CI machine)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149914
Approved by: https://github.com/dcci, https://github.com/cyyever
2025-03-25 11:31:27 +00:00
86dcdf9c8b [BE] Replace XPU support packages installation to offline mode in Linux CI/CD (#149843)
To ensure the build environment is stable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149843
Approved by: https://github.com/EikanWang
2025-03-25 09:11:35 +00:00
86fbbe44cc Improve error message for CUDAGuardImpl, MPSGuardImpl, XPUGuardImpl (#149838)
Fixes #149822

Will get:

```
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/home/jyh/workspace/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch. CUDAGuardImpl initialized with non-CUDA DeviceType: cpu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149838
Approved by: https://github.com/Skylion007, https://github.com/guangyey
2025-03-25 07:29:53 +00:00
a89bdc0565 [Hierarchical Compilation] Handle origin nodes without children (#149685)
Bug discovered running Hierarchical Compilation on HF.

I don't have a smaller repro for this unfortunately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149685
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2025-03-25 07:27:11 +00:00
5a7588f183 [Build] Remove pre-CXX11 ABI logic from build script (#149888)
Only keep one in check_binary_symbols to make sure there are no pre-CXX11 ABI symbols in the library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149888
Approved by: https://github.com/atalman, https://github.com/seemethere
ghstack dependencies: #149887
2025-03-25 03:17:16 +00:00
280e48739a [ONNX] Set is_in_onnx_export for dynamo=True (#149678)
Fixes #149141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149678
Approved by: https://github.com/justinchuby
2025-03-25 03:16:23 +00:00
27657a00d9 Demote logger of runtime_asserts_frozen to be fired only on debug mode (#149832)
Differential Revision: [D71702305](https://our.internmc.facebook.com/intern/diff/D71702305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149832
Approved by: https://github.com/malfet
2025-03-25 02:29:13 +00:00
FEI
59d5cf083b update torch.nn.RelicationPad{1,2,3}d deternimistic documentation (#148633)
https://github.com/pytorch/pytorch/issues/115395
This issue mentioned that when deterministic mode is turned on, added a decomp for replication_pad_{1,2,3}d
to make the backward function deterministic.
@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148633
Approved by: https://github.com/isuruf
2025-03-25 02:01:31 +00:00
d4c578082a [DCP] Cache save plan metadata to reduce the collective overhead (#149785)
Summary:
Cache save plan metadata to reduce the collective overhead.

Global plan dedupe and metadata creation are the main overheads on Rank 0. This change saves all this cost for the subsequent saves if the plans do not change. A quick experiment with the 256 rank job, Global step overhead drops by ~99%, from 90s+ to mere 1.5s. 1.5s was mostly spent on creating the checkpoint module directories and near empty collective.

Differential Revision: D71631441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149785
Approved by: https://github.com/MeetVadakkanchery
2025-03-25 02:00:15 +00:00
dc39e673e2 Remove aten.elu core ATen decomp because it is now core ATen (#149780)
Per @larryliu0820.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149780
Approved by: https://github.com/larryliu0820
2025-03-25 01:59:57 +00:00
84684e9397 [sigmoid] Fix scalar resolution for Scalar_mode aten ops. (#149755)
Summary: For Scalar variant resolution, we didn't handle a corner case of "Tensor_mode" variant (from aten::div). Adding the missing case to the graph pass.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_operator_aten_tensor_mode_variant_cpp_runtime

Differential Revision: D71638433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149755
Approved by: https://github.com/yushangdi
2025-03-25 01:17:36 +00:00
159e97cbcf ProcessGroupGloo: support reduce_scatter + update support chart (#149869)
This adds a `reduce_scatter` implementation for ProcessGroupGloo. This is a pretty naive implementation as it does 1 allreduce per  rank but may be useful for testing in FSDP etc. There was an existing implementation of reduce_scatter_tensor/reduce_scatter_tensor_coalesed that has a very similar implementation but requires a fixed tensor size per rank.

If users find these functions to be too slow we can address them as issues arise.

Gloo now supports all major distributed operations. Quite a few of these were added by @rohan-varma and @yifuwang but they didn't update the support chart. We also have `CUDAWork` variants of most operations so those were also added to the chart.

Test plan:

```
pytest -v test/distributed/test_c10d_gloo.py -k reduce_scatter
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149869
Approved by: https://github.com/fduwjj
2025-03-25 01:16:12 +00:00
5af9cb12b7 [ROCm] Extend vectorized elementwise kernel to more heterogenous tensor types. (#149738)
This patch extends the initial support for "vectorized templated" kernels to the following input tensor types: (BFloat16, float)
(float, float16)
(float16, float)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149738
Approved by: https://github.com/jeffdaily
2025-03-25 01:10:01 +00:00
2a9e737839 [caffe2] Do not use --no-as-needed on macOS (#149421)
Summary:
`--no-as-needed` is not available in ld64.lld

Applying this on all macos is potentially too broad? I am not sure if `fbcode//mode/mac` uses a different linker, but arvr mode for sure uses ld64.lld.

Test Plan: CI / used for a macOS build on top of the stack.

Differential Revision: D71315125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149421
Approved by: https://github.com/colesbury
2025-03-25 00:41:09 +00:00
1cee6c37cc add bobren and laithsakka as ds owners (#149873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149873
Approved by: https://github.com/laithsakka
2025-03-25 00:14:04 +00:00
23855391f1 Add regression tests for 3 missing PR-time benchmarks (#149423)
Uses values from the latest PR-time benchmark run on viable/strict. See https://github.com/pytorch/pytorch/actions/runs/13898520615/job/38900894469 for a job showing why this is needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149423
Approved by: https://github.com/laithsakka
2025-03-24 23:39:36 +00:00
ba46643df1 [MPS] tril op not handling infs correctly (#149866)
Fixes #149813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866
Approved by: https://github.com/malfet
2025-03-24 23:38:41 +00:00
51f91e3428 [CD] Check that nightly x86 binaries are build with gcc-11 (#149887)
Though they should have been with gcc-14, per https://github.com/pypa/manylinux?tab=readme-ov-file#manylinux_2_28-almalinux-8-based
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149887
Approved by: https://github.com/atalman, https://github.com/seemethere
2025-03-24 23:22:19 +00:00
f320c7b766 Rename README.txt to README.md (#149811)
I am 99% sure this is meant to be a .md file rather than a .txt file

Fixes an issue with viewing the README on github, idk what else this accomplishes but it's been bothering me

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149811
Approved by: https://github.com/colesbury
2025-03-24 22:33:33 +00:00
490ce7e67c [sigmoid] Support _operator.neg/truediv (#149754)
Summary: adding operator.truediv and operator.neg support to the runtime

Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_sym_float_operators_cpp_runtime_nonstrict

Differential Revision: D71637267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149754
Approved by: https://github.com/pianpwk
2025-03-24 22:15:25 +00:00
e77ca19999 [Inductor-CPU] Fix int8 WoQ AMX micro-kernel when block_n is 16 or 48 (#149359)
### Summary

When the block-size for `N` dimension is `48` for the AMX GEMM micro-kernel for int8 WoQ (BF16 activation, int8 statically quantized weights), the logic for handling the tail is incorrect - we can't always dequantize 32 elements of weights at a time because we may need to dequantize `32` followed by `16` when `block_n` is `48` (for each `K`).

This PR fixes that logic, which was initially exposed with `M=17, N=1024, K=1024`.
This PR also fixes the case of `block_n` being 16.

I had introduced [this bug ](ca9813ea14) after misreading GEMM blockings as `["block_m", "block_k", "block_n"]` instead of `["block_m", "block_n", "block_k"]` (so I had wrongly assumed that `block_n` was always 32).

### Future work

While this PR simply fixes a bug, it's possible to optimize the code pertaining to dequantizing & caching the B buffer - for `block_n` being `16` or `48`, `K` would always be a multiple of 2, so `K * block_n` will always be a multiple of 32. Since `dequantized_B_buf` stores rows contiguously, when `block_n` would be `16` or `48`, we could store 32 BF16 elements at a time instead of storing `16` at a time (when `block_n` is 16), or `32` followed by `16` at a time (when `block_n` is 48). Such an optimization would lower `register -> memory` data movements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149359
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2025-03-24 21:27:46 +00:00
49f86a939c [AOTAutogradCache] Allow Custom Autograd functions behind a flag (#149751)
This adds a new env var and flag,

autograd_cache_allow_custom_autograd_functions, (env var: `TORCHINDUCTOR_AUTOGRAD_CACHE_ALLOW_CUSTOM_AUTOGRAD`) which allows custom autograd functions into AOTAutogradCache.

@hirsheybar and I worked together to verify that the higher order op AutogradFunctionApply is pure with respect to the dynamo input being passed in, so this *should* be safe. I'm still putting it behind a flag and turning it on slowly, first on an internal model, though. Once we verify that it is correct on the internal model we can work to enable the flag by default.

Differential Revision: [D71633184](https://our.internmc.facebook.com/intern/diff/D71633184/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149751
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-03-24 21:12:11 +00:00
ae6158500a [dynamo] fix calling torch function on newly constructed tensor subclass (#149481)
This patch updates existing `test_return_..._subclass` tests in
`test/dynamo/test_subclasses.py`, so that they end up invoking the
`__torch_function__` method of the newly constructed tensor subclass
instnaces.

This exposes a bug in `TensorVariable.method_as_subclass`, where it
forgot to grab the `__func__` out of `__torch_function__`, which led to
the an error down the line.

This patch fixes `TensorVariable.method_as_subclass` by centralizing how
we extract and wrap torch function, in `build_torch_function_fn`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149481
Approved by: https://github.com/jansel
2025-03-24 21:07:41 +00:00
f12969421e [DYNAMO] [BUG FIX] correct casting to boolean for TORCH_COMPILE_DISABLE (#149852)
Fixes #149840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149852
Approved by: https://github.com/jingsh
2025-03-24 20:50:44 +00:00
b248edd7cc ProcessGroupGloo: support ReduceOp::AVG (#149781)
This adds AVG support to ProcessGroupGloo to better support FSDP on CPU. I expect there will be more issues but this is easy enough to support in a naive fashion.

This applies to both reduce and allreduce.

This is a simple SUM + division and may not be the most numerically stable but that's expected. FSDP for low precision data types implements pre/post divide and uses SUM instead.

Test plan:

```
pytest -v test/distributed/test_c10d_gloo.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149781
Approved by: https://github.com/fduwjj
2025-03-24 20:29:30 +00:00
40ec9d2bfa avoid allocation when tensor_new from storage (#149797)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149797
Approved by: https://github.com/Skylion007
2025-03-24 20:02:45 +00:00
112f983056 [MPS] Replace indexed with strided flavor (#149730)
Which renders non-contiguous operations much faster for larger tensors, for example `fmax` of 1000x1000 strides tensors takes 270ms with new algorithm and 430ms with an old one, that needed additional tensor of 3e6 elements to function.

TODO: Add 64-bit indexing logic, as current implementation has the same limitation as `generateKernelDataOffsets`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149730
Approved by: https://github.com/dcci, https://github.com/manuelcandales
2025-03-24 19:37:51 +00:00
9179178728 [MPS] Add support for chebyshev_polynomial_t in eager. (#149816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149816
Approved by: https://github.com/malfet
2025-03-24 19:19:55 +00:00
1e5a561c13 [ca] fix accumulate grad polyfill when different strides between param and grad (#149651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149651
Approved by: https://github.com/jansel
ghstack dependencies: #149647, #149709
2025-03-24 19:06:45 +00:00
754875e237 [ca] API comments and support dynamic shapes via configs (#149709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149709
Approved by: https://github.com/jansel
ghstack dependencies: #149647
2025-03-24 19:06:45 +00:00
86ee3bf3d5 [ca] use torch.compile ca API for benchmarks (#149647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149647
Approved by: https://github.com/jansel
2025-03-24 19:06:45 +00:00
71145059c8 Allow rebuild of triton on workflow_dispatch (#149865)
Allows to rebuild triton from main.
latest triton build failed : https://github.com/pytorch/pytorch/actions/runs/13984299781/job/39298288914
The cause PR was reverted: https://github.com/pytorch/pytorch/pull/148419
We need to rebuild the triton now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149865
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-24 18:17:47 +00:00
bada898f5e Revert "Extend vec backend with BF16 SVE intrinsics (#143666)"
This reverts commit d072254eaea325a507c1498431e4c8294205fe2d.

Reverted https://github.com/pytorch/pytorch/pull/143666 on behalf of https://github.com/malfet due to I'm unsure why this PR got merged, as it doesn't have a valid review ([comment](https://github.com/pytorch/pytorch/pull/143666#issuecomment-2749013169))
2025-03-24 18:13:50 +00:00
5beb5b7e47 [torch/c10d] change class variable from private to protected (#149579) (#149645)
Summary:

Change class variable from private to protected in ProcessGroupNCCL

Test Plan: Existing UT Pass.

Reviewed By: kingchc, kwen2501

Differential Revision: D71373067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149645
Approved by: https://github.com/kwen2501
2025-03-24 17:58:54 +00:00
d0c06c4533 [ROCm] Update libamd_comgr.so file in triton wheel build (#149855)
In ROCm 6.4 and newer, when building Triton in the Triton-ROCm wheel build flow, newer releases of ROCm no longer have **libamd_comgr.so.2** as the .so file has been updated to **libamd_comgr.so.3** in ROCm 6.4 and newer. We conditionalize on which ROCm the wheel build is for, and choose the .so accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149855
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
2025-03-24 17:51:14 +00:00
60f31f551e Only print dde partial fx graph for export (#149831)
Lazos correctly pointed out this doesn't make sense for compile since
we graph break in compile. This results in tons of unwanted user log
spew. We do want this in export though since it's drastiaclly reduced
the support load for DDEs. This PR does the refactor to keep it in
export but remove it from compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149831
Approved by: https://github.com/mlazos
2025-03-24 17:46:18 +00:00
42e7bda53e Revert "[export] Save unflattened gm (#149717)"
This reverts commit 1e159db57c611b98a531341927b2d01f39383f7a.

Reverted https://github.com/pytorch/pytorch/pull/149717 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/149717#issuecomment-2748924563))
2025-03-24 17:41:01 +00:00
6608d4e3e9 [dynamo] keep chained exceptions in user-facing tracebacks (#149676)
This preserves graph breaks in the case that one graph break directly causes another, e.g. graph breaks in generic context managers.

```python
import torch

class CtxMgr:
    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        pass

@torch.compile(backend="eager", fullgraph=True)
def fn():
    with CtxMgr():
        with CtxMgr():
            pass
        with CtxMgr():
            with CtxMgr():
                pass
            torch._dynamo.graph_break()

fn()
```

Output:
```
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
  Explanation: User-inserted graph break. Message: None
  Hint: Remove the `torch._dynamo.graph_break()` call.

  Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/users/williamwen/pytorch/playground.py", line 23, in <module>
    fn()
  File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 664, in _fn
    raise e.with_traceback(None) from e.__cause__
torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable
  Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking.
  Hint: Move the offending context manager(s) to outside the compiled region.
  Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one.

  Developer debug context: Active generic context managers: [GenericContextWrappingVariable(CtxMgr), GenericContextWrappingVariable(CtxMgr)]

from user code:
   File "/data/users/williamwen/pytorch/playground.py", line 20, in fn
    torch._dynamo.graph_break()

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```

Note in particular that both graph breaks (torch._dynamo.graph_break and graph break in context manager) are present in the logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149676
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305
2025-03-24 17:36:13 +00:00
1e159db57c [export] Save unflattened gm (#149717)
Test Plan: CI

Differential Revision: D71082652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149717
Approved by: https://github.com/pianpwk
2025-03-24 17:25:25 +00:00
0a0a73a9a9 [cond] don't trace fw and bw graph in autograd key (#148930)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930
Approved by: https://github.com/zou3519
2025-03-24 17:07:29 +00:00
9bae904cb4 [inductor] fix combo_kernel logging #2 (#149772)
Summary:
fix another combo kernel logging error:

  File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 2036, in _init
    self.create_combo_kernel_nodes(num_ck_nodes=None)
  File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 3068, in create_combo_kernel_nodes
    log.debug("ComboKernels: Generating with num_ck_nodes = %d...", num_ck_nodes)
Message: 'ComboKernels: Generating with num_ck_nodes = %d...'
Arguments: (None,)

Test Plan:
Verified in test_combo_kernel.py

the logging error went away.

Differential Revision: D71655949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149772
Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007
2025-03-24 16:57:45 +00:00
453da423d4 Revert "ci: Add sccache to manylinux images (#148419)"
This reverts commit 1099c371505a6a3e3cab69e5afca1e747f2215a4.

Reverted https://github.com/pytorch/pytorch/pull/148419 on behalf of https://github.com/atalman due to Breaks triton build ([comment](https://github.com/pytorch/pytorch/pull/148419#issuecomment-2748759515))
2025-03-24 16:43:26 +00:00
a439524be6 [inductor] Add the largest matmul tile size to default tuning set (#149790)
While we probably don't want to expand the set of default matmul tunings too much, this is the largest tile size usable by H100 and A100, and is usually the top performing tile size for large matmuls.  E.g. on H100 adding this tile size improves perf of multiplying 8192-square matrices from 600->700 tflops.  (cuBLAS 12.6 gets 780, so Triton still isn't SOTA, but closer)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149790
Approved by: https://github.com/jansel
2025-03-24 16:32:53 +00:00
db92d0f388 A bunch of typos (#149404)
Improves readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149404
Approved by: https://github.com/soulitzer
2025-03-24 16:16:04 +00:00
ddc0fe903f ci/docker: use NCCL 2.26.2-1 (#149778)
Related to #149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149778
Approved by: https://github.com/Skylion007, https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-24 16:14:54 +00:00
0a60a0cad4 Let pointwise sharding take arg with largest number of dims in case of ties (#149721)
Before, we would take the first argument with the largest number of shards, regardless if it had fewer dims than another arg with the same number of shards but more dimensions. This would lead to potentially fewer sharding options

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149721
Approved by: https://github.com/tianyu-l
2025-03-24 15:39:39 +00:00
2c13a07002 [CI] Fix xpu linux test permission issue and add ci docker image pull (#149053)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149053
Approved by: https://github.com/atalman
2025-03-24 15:19:24 +00:00
db9b031b00 Add default XPU toolkit path to CMake (#149270)
# Motivation
Add default XPU runtime path to CMake to mitigate https://github.com/pytorch/pytorch/issues/149075
This ensures proper linking with `libtorch` when a user does not source the Torch XPU toolkit while working on a C++ library or executable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149270
Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/atalman
2025-03-24 14:41:24 +00:00
66b0a0b61a [inductor] support dilation in max_pool2d lowering (#148209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148209
Approved by: https://github.com/eellison
2025-03-24 13:00:12 +00:00
dfdc28ea67 Update slow tests (#149844)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149844
Approved by: https://github.com/pytorchbot
2025-03-24 12:12:56 +00:00
248487f455 [MPS] nanmedian with dims (#149680)
Third most voted op from #77764

Tests were deleted because they are covered by the regular test_output_match tests so those were redundant and were added in the last PR before the nanmedian dim version would be implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149680
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-24 03:49:16 +00:00
d5ce5c9509 Reuse format_size utils (#149383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149383
Approved by: https://github.com/malfet
2025-03-24 03:06:27 +00:00
de3aca3311 [StaticCudaLauncher] Support any number of kernel arguments (#149442)
Fixes #149450

This PR adds fallback support on StaticCudaLauncher for any number of kernel arguments. Above MAX_ARGS, we can do a heap allocation/malloc instead.

For 0 arguments, triton technically does some undefined behavior by allocating a 0 byte array and passing it to cuLaunchKernel. In reality, cuLaunchKernel never accesses the pointer if the singature of the cubin has no parameters, so we can just pass nullptr directly.

We could technically use `alloca` to stack allocate instead of heap allocate, though in my tests it didn't seem to affect runtime performance on benchmarks particularly impressively, and alloca has portability issues, so I'd rather just stick with something simpler for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149442
Approved by: https://github.com/jansel
2025-03-23 22:43:47 +00:00
2dccd70ef0 [ONNX] Clean up legacy dynamo export code (#149745)
Clean up code that is unused and obsolete. The public `torch.onnx.dynamo_export` is kept for now but the legacy implementation is removed.

Remove public option classes and OnnxRegistry that have been deprecated.

Users: use torch.onnx.export(…, dynamo=True).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149745
Approved by: https://github.com/titaiwangms, https://github.com/cyyever
2025-03-23 19:35:16 +00:00
8bece88655 [BE] Eliminate TODO for 2022 (#149557)
Need to think a bit more about what types.h includes

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149557
Approved by: https://github.com/albanD
2025-03-23 05:35:54 +00:00
c201d4dbea elif is not a cmake keyword (#149655)
Test for pocketfft_header not in its place is wrong
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149655
Approved by: https://github.com/Skylion007
2025-03-23 03:28:53 +00:00
85027ef74a Super tiny fix typo (#149109)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149109
Approved by: https://github.com/malfet
2025-03-23 03:02:53 +00:00
fe954cdcbf Use correct boxed_forward_device_index when running CompiledFxGraph.post_compile (#148130)
This PR threads through the correct boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. This allows us to correctly update BoxedDeviceIndex from cache hits.

We don't actually need to save `boxed_forward_device_index` in CompiledFXGraph because its value is in the cache key, so it always matches to the ambient one anyway. On forward with cudagraphs enabled, derive `boxed_forward_device_index`'s value from `device_idxs`.

Testing:

```
python benchmarks/dynamo/cachebench.py --mode training --benchmark torchbench --model BERT_pytorch --device cuda --repeat 1 --dynamic --output="dynamic.json"
```

Now cache hits properly on FXGraphCache. AOTAutogradCache has a guard failure. Will look into that as a followup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148130
Approved by: https://github.com/eellison
2025-03-23 02:57:58 +00:00
539db4af4b load_inline no_implicit_headers mode (#149480)
In the kernelBot leaderboard we support people competing with custom cuda extensions via `load_inline()`, however even on toy kernels this can result in cold starts of up to 90s - this feature is primarily responsible for us having to double our timeout values

I performed an investigation here https://github.com/msaroufim/load_inline_slow and the primary cause was that torch/extension.h and torch/types.h add in about 5,000 header files https://github.com/msaroufim/load_inline_slow/blob/main/header-analysis

So we introduce a mode `no_implicit_headers` which forces users to be explicit about exactly what they want to add. There's a proper test meant to be used in a CLI and a pytest test that's not terribly helpful

Then there's still an open question around what's the most minimal example implementation we can provide. For the baseline kernel we're showing here, it takes about 1 min to compile
1. There's using TensorBase.h (finicky to get right but can get compilation times down to 7s)
2. Just using Tensor.h (down to 15s)
3. Using Shim.h (did not try yet since the syntax is verbose relative to cuda)

This is my take so far https://gist.github.com/msaroufim/079a8d08ffebd0f91a1c2247eb0ce9e0 for a minimal implementation at 15s but @malfet has a simpler one at only 5s

There's more things I'd like to try moving forward like nvrtc and fancier compilation flags. Typical advice around using precompiled headers does not apply to us because we are mostly interested in cold starts where we tear down the machine after running a kernel

Also in a future PR I'd like to fix issue I've noticed with load_inline
1. It needs a force recompilation mode, I was using this quite a bit myself
2. The cache does not take into account changes in environment so the best way to force a recompilation is to change some string in the file
3. Instead of relying on pybind, can we use TORCH_LIBRARY instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149480
Approved by: https://github.com/malfet
2025-03-22 19:21:29 +00:00
cyy
9367f8f6f1 Remove outdated instructions from CI scripts (#149795)
Some instructions about Python 3.8 and CUDA 11.3 are removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149795
Approved by: https://github.com/malfet
2025-03-22 18:37:07 +00:00
2b848ab192 [MPS/inductor] Add support for modified_scaled_bessel_k{0,1} (#149794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149794
Approved by: https://github.com/malfet
2025-03-22 15:41:40 +00:00
6bbe8dbd63 [dynamo][hooks] config to wrap the top frame in a wrapper (#149758)
This should be done by default but there are too many issues. This PR is a
workaround.

https://github.com/pytorch/pytorch/issues/117584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149758
Approved by: https://github.com/yf225
ghstack dependencies: #149712
2025-03-22 07:17:01 +00:00
621c801f78 fix dynamic float when dynamic=True (#149564)
Fixes https://github.com/pytorch/pytorch/issues/149406#issuecomment-2738111733. Basically previously we would only make floats dynamic via automatic dynamic, now if you set dynamic=True, we will make the floats dynamic on the first compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149564
Approved by: https://github.com/laithsakka
2025-03-22 05:58:59 +00:00
eqy
8f7fbe3d7d [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-03-22 05:50:11 +00:00
51fa8fb0ff [executorch hash update] update the pinned executorch hash (#149585)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149585
Approved by: https://github.com/pytorchbot
2025-03-22 05:14:19 +00:00
01b1d1f91b [ROCm][TunableOp] Fix offline tuning for ScaledGEMM. (#149677)
The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally:
- All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match.
- We raise an error if submatrices are encountered as this is only supported in online tuning mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677
Approved by: https://github.com/jeffdaily
2025-03-22 02:22:13 +00:00
b9a5e1d038 [MPS] Add support for scaled_modified_bessel_k1 to eager. (#149783)
Another day another op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149783
Approved by: https://github.com/malfet
2025-03-22 02:13:41 +00:00
021b3e23ec Fix is_nonzero for more than one elem tensors (#149637)
Differential Revision: [D71560442](https://our.internmc.facebook.com/intern/diff/D71560442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149637
Approved by: https://github.com/pianpwk
2025-03-22 02:08:28 +00:00
9d02b3993f [PT2] Port use_triton_lce to PT2 pre_grad passes (#149702)
Summary:
`use_triton_lce_replace_simple_LCE` and `use_triton_lce_replace_normal_LCE`

code is mostly the same, some minor changes to support aten IR

Test Plan:
```
scripts/aetk/aetk -L
%run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py
```

will verify the qps after everything done in the stack

Reviewed By: frank-wei

Differential Revision: D68909857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149702
Approved by: https://github.com/frank-wei
2025-03-22 00:36:58 +00:00
c73a526599 Extract reusable portions of elu_kernel into header (#149673)
Similar to #140425, we are making the implementation usable via header-only code sharing.

Review note: #62546 by @yanbing-j removed expm1 usage from this path. I don't know why and expm1 should be more efficient, so I've put it back. Please let me know if there is a good reason I shouldn't.

Testing: existing correctness tests should cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149673
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-03-21 23:54:26 +00:00
b238e36fd9 Revert "[BE][Ez]: Update CU126 to CUDNN 12.8 too (#149254)"
This reverts commit b0a5d55c584792a504ec18600180e3d1200dfea6.

Reverted https://github.com/pytorch/pytorch/pull/149254 on behalf of https://github.com/izaitsevfb due to seems to be causing multiple test failures ([comment](https://github.com/pytorch/pytorch/pull/149254#issuecomment-2744686862))
2025-03-21 23:44:09 +00:00
27370998b2 [MPS][BE] Move polar/complex to stubs (#149752)
No need to have in-place MPS kernel, as it just copy-n-paste of code
from TensorFactories.cpp into Binarykernel.mm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149752
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #149727, #149728, #149729
2025-03-21 22:36:05 +00:00
d320af0663 [dynamo] Ensure placeholder name is not an intermediate node name (#149712)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1615671879071017/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149712
Approved by: https://github.com/zou3519
2025-03-21 22:24:45 +00:00
7f836b747f partitioner: ensure collectives saved by SAC that are actually unused in the bw are properly not saved (#149652)
This PR fixes one of the issues described here: https://github.com/pytorch/torchtitan/issues/866#issuecomment-2726015248

I spent some time trying to write a unit test and ultimately failed. If folks are interested I can spend more time trying to, but otherwise I have an E2E test with torchtitan. command:
```
CUDA_VISIBLE_DEVICES=1,2,3,4 NGPU=4 CONFIG_FILE="./torchtitan/models/llama/train_configs/llama3_8b.toml" tlp ./run_train.sh --training.steps=30  --training.tensor_parallel_degree=2 --training.compile --experimental.enable_async_tensor_parallel
```

here's the backward graph generated prior to the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/f7d17388-42c2-4d7e-8a55-a00387341ecb/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

and new backward graph with the PR: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/ab8576fc-98c1-4915-af47-699aa8e2557e/custom/rank_0/-_0_0_0/aot_backward_graph_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The main difference is that the input arg `reduce_scatter_tensor_1` is dead code in the bw graph, causing us to unnecessarily save a giant `reduce_scatter` for bw. With the PR, we properly ensure that it is not saved for backward.

More comments in the PR, but the main thing going on is that:

(1) We have some existing logic that checks for activations that are actually dead code in the backward, and removes them

(2) collectives are not properly handled by this code. Why? collective are **always** followed by  `wait_tensor()` call. So we need to go one node further and check if the "dead" code has a wait_tensor user that is also dead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149652
Approved by: https://github.com/zou3519
ghstack dependencies: #149514
2025-03-21 22:09:19 +00:00
1c6b517e19 DTensor: more generically support CompositeImplicitAutograd ops under inference mode (#149514)
Today, if you run DTensor (or any tensor subclass) under __torch_dispatch__, you will start seeing `CompositeImplicitAutograd` ops show up in the torch_dispatch.

"handling" these ops is trivial: you can just tell them to decompose into their constituent ops. Normally this decomposing happens in autograd, above DTensor, but inference_mode turns autograd off, forcing the subclass to handle the op directly.

It looks like previously we manually added a few CompositeImplicitAutograd entries to DTensor (e.g. linear), but this PR tries to support these ops a bit more generically.

The main difference is that DTensor now needs to check if a given op is `CompositeImplicitAutograd` before attempting to run sharding prop. I ran a quick microbenchmark for the below code with `timeit`, which gave me overhead on the order of ~1us, which is hopefully not too bad for eager mode:

```
        def fast_function():
            return torch._C._dispatch_has_kernel_for_dispatch_key(op_call.name(), torch._C.DispatchKey.CompositeImplicitAutograd)
        import timeit
        time_taken = timeit.timeit(fast_function, number=1000)
        # printed 0.12..., aka 1.2us
        print(f'func={str(op_call)}, time={str(time_taken)}')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149514
Approved by: https://github.com/kwen2501, https://github.com/albanD, https://github.com/wanchaol
2025-03-21 22:09:19 +00:00
d46c16fca6 [FSDP2] warning that reshard_after_forward=1 and True are different (#149750)
people complains about spending time to debug reshard_after_forward=1. What they actually want is reshard_after_forward=True. 1 and True can be used interchangeably in programming generally, add one-time warning to remind they are different
* reshard_after_forward=1 means resharding parameters to world size 1, by keeping unsharded parameters from forward to backward
* reshard_after_forward=True means reshard parameters to FSDP mesh

from FSDP2 perspective, our docstring is clear about int vs bool https://pytorch.org/docs/main/distributed.fsdp.fully_shard.html

<img width="764" alt="Screenshot 2025-03-21 at 11 02 55 AM" src="https://github.com/user-attachments/assets/6675f7a4-95a0-4421-8dbf-f47e9fdeca26" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149750
Approved by: https://github.com/mori360, https://github.com/msaroufim, https://github.com/wconstab
2025-03-21 22:05:20 +00:00
ff020d32b6 [export] Patch dynamo configs when nonstrict tracing (#149295)
Differential Revision: [D71298929](https://our.internmc.facebook.com/intern/diff/D71298929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149295
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-03-21 21:44:54 +00:00
fb07fe6f36 pretty print graph signature (#149710)
Fixes #141243

Differential Revision: [D71604218](https://our.internmc.facebook.com/intern/diff/D71604218/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149710
Approved by: https://github.com/angelayi
2025-03-21 21:31:58 +00:00
5757aa8773 Cudagraph fix + comment cleanup (#149741)
Cudagraphs is careful to not allow any memory recorded to escape globally without having a reference to the tensor. This is because we may later reclaim that memory for a cudagraph recording and we need to mark the tensor as erroring on access. Very occasionally, a stray tensor will have been allocated locally but not yet cleaned up. In this case, we enter the slow path and try to gc.collect() to deallocate it. From a hard to repro internal use case, this was fixed by an additional `cuda.synchronize()`.

i also snuck in an outdated comment and a duplicate line removal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149741
Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007
2025-03-21 21:12:36 +00:00
842d51500b Parallelize sort (#149505)
PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149505
Approved by: https://github.com/fadara01, https://github.com/Skylion007
2025-03-21 20:54:40 +00:00
85f6d61421 [BE] format test/inductor/s429861_repro.py (#148554)
Split from #148186

The diff can be re-generated with the following code in the repo root directory on main branch:

```python
import re
from pathlib import Path

def replace(m: re.Match) -> str:
    s = m.group()
    if '\n' not in s:
        return s
    indent = m.group("indent")
    varnames = s.removesuffix("None").replace("=", "").replace("(", "").replace(")", "").split()
    return "\n".join(
        [
            f"{indent}(",
            *(f"{indent}    {varname}," for varname in varnames),
            f"{indent}) = (None,) * {len(varnames)}",
        ]
    )

file = Path('test/inductor/s429861_repro.py')
content = file.read_text(encoding='utf-8')

new_content = re.sub(
    r"^(?P<indent> *)\w+ *=(\s*(\(\s*\w+\s*\)|\w+)\s*=\s*)+None$",
    replace,
    content,
    flags=re.MULTILINE,
)

file.write_text(new_content, encoding='utf-8')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148554
Approved by: https://github.com/jansel
2025-03-21 20:39:28 +00:00
c5deacc27a Fix subclass access custom op bug (#149698)
Summary: When we call torch.inference_mode, we seem to skip Autograd key causing the custom op export uses to be not decomposed properly before subclass dispatching starts. We fix this by force desugaring this op at Python key

Test Plan: test

Differential Revision: D71599541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149698
Approved by: https://github.com/bdhirsh
2025-03-21 19:42:56 +00:00
09aa63ea2c preserve custom meta in placeholders (#149661)
Fixes #147338

Differential Revision: [D71573533](https://our.internmc.facebook.com/intern/diff/D71573533/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149661
Approved by: https://github.com/junpeiz, https://github.com/angelayi
2025-03-21 19:09:38 +00:00
0eb3ac9349 Make sure to write to caches atomically (#149654)
This is an attempt to fix #119698

I was unable to reproduce the original described problem on the latest trunk but the proposed fix makes sense. Instead of adding locks like the original (unlanded) fix I changed a few of the cache writes to be atomic file swaps (write to temp file, rename file) which should have the same effect without blocking reads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149654
Approved by: https://github.com/eellison
2025-03-21 18:59:41 +00:00
46dd226702 Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529)
Summary:
We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly.

- _fakify_script_objects in `compile_fx`
- Allow fake torchbind objects in `torchbind_constants`

Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens.

Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API.

Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`.

Test Plan:
```
buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind

buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms

buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc

```

Differential Revision: D70013257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529
Approved by: https://github.com/angelayi
2025-03-21 18:58:28 +00:00
19b763def1 Skip test if torchvision is not available (#149494)
The test unconditionally imports torchvision and fails if the isn't installed.
Skip it in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149494
Approved by: https://github.com/janeyx99
2025-03-21 18:57:13 +00:00
b0a5d55c58 [BE][Ez]: Update CU126 to CUDNN 12.8 too (#149254)
Have CUDNN have the same version for 12.6 and 12.8 for better performance and consistency. We can't do CU12.1 because it's not supported and CU12.4 isn't updated due to manywheel Linux compatibility reasons and dropping support for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149254
Approved by: https://github.com/jansel, https://github.com/atalman, https://github.com/tinglvv
2025-03-21 18:20:44 +00:00
1b08aaeafe Supporting non-tensor-data write_size in planner write items. (#149699)
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)
first do a lookup in to the actual tensor/bytes
then calculate the nbytes.
This change introduce a way to capture non-tensor data size within a write-plan item.

Test Plan: Existing UT.

Differential Revision: D71599725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149699
Approved by: https://github.com/MeetVadakkanchery
2025-03-21 18:09:14 +00:00
f7d1b966c2 [Inductor] Unify the data type propagation between Triton and CPP Backend (#146970)
Fixes #144246

Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel
2025-03-21 17:52:51 +00:00
99a4fc5a2f Add elu as core ATen (#149684)
Differential Revision: [D71590420](https://our.internmc.facebook.com/intern/diff/D71590420/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149684
Approved by: https://github.com/larryliu0820
2025-03-21 16:56:10 +00:00
fa5f556f88 [CI] enable operator benchmark on CPU (#143733)
This is to enable operator benchmark for CPU to track op level performance. This PR is motivated by PR: https://github.com/pytorch/pytorch/issues/120982 and investigate feasibility in https://github.com/pytorch/pytorch/pull/127216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143733
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/huydhn, https://github.com/malfet

Co-authored-by: diwei sun <diwei.sun@intel.com>
Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
2025-03-21 16:46:03 +00:00
700260f166 [MPS][BE] Get rid of supports_dense flag (#149729)
As now all binary ops supports dense
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149729
Approved by: https://github.com/dcci
ghstack dependencies: #149727, #149728
2025-03-21 16:37:03 +00:00
64d22b9fad [MPS][BE] Migrate complex_mul to tensor iterator (#149728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149728
Approved by: https://github.com/dcci
ghstack dependencies: #149727
2025-03-21 16:37:03 +00:00
e35ef61066 [MPS][BE] Migrate torch.complex to binary_functor (#149727)
As it's very similar in nature to `torch.polar`
Though rename kernel from `complex_kernel` to `make_complex`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149727
Approved by: https://github.com/dcci
2025-03-21 16:36:56 +00:00
bdc132d0e1 [MPS] Add support for scaled_modified_bessel_k0 for eager. (#149705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149705
Approved by: https://github.com/malfet
2025-03-21 16:14:29 +00:00
1eab841185 Add release branch push triggers to inductor-rocm-mi300.yml (#149672)
In similar vein as https://github.com/pytorch/pytorch/pull/149517

When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149672
Approved by: https://github.com/jeffdaily
2025-03-21 16:02:03 +00:00
5d4b5ee315 [MPS] Add inline to function definition. (#149704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149704
Approved by: https://github.com/malfet
2025-03-21 14:53:09 +00:00
d072254eae Extend vec backend with BF16 SVE intrinsics (#143666)
- Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`.
- Added bf16 detection in CMake
- Added a guard for native NEON code to prevent compilation errors

@aditew01 @maajidkhann please have a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666
Approved by: https://github.com/swolchok, https://github.com/aditew01

Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>
2025-03-21 10:55:11 +00:00
68dfd44e50 Do not depend on numpy during the import (#149683)
But a good followup would be to use torch primitives instead of numpy here
Fixes https://github.com/pytorch/pytorch/issues/149681

Test plan: Monkey-patch 2.7.0-rc and run `python -c "import torch;print(torch.compile(lambda x:x.sin() + x.cos())(torch.rand(32)))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149683
Approved by: https://github.com/seemethere
2025-03-21 08:14:57 +00:00
34743678b9 [Dynamo] Cleanup state management for ctx managers (#149689)
Removes state indirection for ctx managers. This isn't needed anymore since VTs are mutable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149689
Approved by: https://github.com/StrongerXi
2025-03-21 07:18:33 +00:00
cfc08caea9 [ROCm] NLLLoss (torch.nll_loss) Performance Tuning by Dynamically Selecting # of GPU threads (#149548)
Instead of fixing the number of GPU threads to 32 regardless of input size, this PR dynamically selects the number of threads based on the formula: clamp(2^round(log2(dim0/16)), min = 32, max = 1024). The experiments below were done on an MI300 machine for data type float32:

![nll_loss_threads_bests](https://github.com/user-attachments/assets/3be3d465-e3db-44ed-991a-fdfcab03baae)
![nll_loss_heauristic](https://github.com/user-attachments/assets/e82b9788-9b4d-4862-a180-8df7ad298182)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149548
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-03-21 07:16:37 +00:00
0ed34210b2 [MPS] Add support for modified_bessel_k1 to eager and inductor. (#149687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149687
Approved by: https://github.com/malfet
2025-03-21 04:59:06 +00:00
0a396a8160 [Docs] Make torch.Library's kind have no default value to be consistent with the code (#149390)
Fixes #149389

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149390
Approved by: https://github.com/janeyx99
2025-03-21 04:42:10 +00:00
4ea580568a update aotinductor doc for XPU support (#149299)
as title. Since the AOTInductor feature starting from 2.7 works on Intel GPU, add the related contents into its doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149299
Approved by: https://github.com/guangyey, https://github.com/desertfire
2025-03-21 04:40:31 +00:00
ccd5d811e8 [aoti] follow up to use new api in test_provenance_tracing.py (#149387)
Summary:
As title. Follow up of  D71181284. and some minor refactoring

Context : D69609685 (update test runner to use new api) / https://github.com/pytorch/pytorch/pull/147105

Test Plan:
```
buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing_cpu
```

Differential Revision: D71375725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149387
Approved by: https://github.com/yushangdi
2025-03-21 04:37:50 +00:00
5327894812 [BE] Introduce lapack_work_to_int function (#149682)
That could be used to safely cast floating values to int by adding an ULP, which is a followup after https://github.com/pytorch/pytorch/pull/146456

Fixes https://github.com/pytorch/pytorch/issues/149591

(Not adding unittest as it's just going to be too slow)
Test plan:
```
% python3 -c "import torch; torch.pinverse(torch.rand(50000, 8193))"
```

Before the change errored out with
```
RuntimeError: false INTERNAL ASSERT FAILED at "pytorch/pytorch/aten/src/ATen/native/BatchLinearAlgebra.cpp":1605, please report a bug to PyTorch. linalg.svd: Argument 12 has illegal value. Most certainly there is a bug in the implementation calling the backend library.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149682
Approved by: https://github.com/wdvr
2025-03-21 04:08:07 +00:00
bf6621d08f [Distributed] Add repr methods for ParallelStyles (#149478)
Fixes #149470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149478
Approved by: https://github.com/wanchaol
2025-03-21 03:59:25 +00:00
ee6a029165 [XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149511
Approved by: https://github.com/EikanWang
2025-03-21 03:56:00 +00:00
732f9d7435 Optimize torch.equal description (#149618)
Fixes #149222

## Test Result

![image](https://github.com/user-attachments/assets/559a376f-2dd0-4474-bbd5-9299d9df51e3)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149618
Approved by: https://github.com/zou3519
2025-03-21 03:44:49 +00:00
64bd889660 [Inductor][CPP] rename shim_mkldnn.h/.cpp to shim_cpu.h/.cpp (#149372)
**Summary**
Previous discussion is here: https://github.com/pytorch/pytorch/pull/148907#issuecomment-2712795600
Rename these files because
- they may hold mkldnn-unrelated code for CPU
- filenames are aligned with files for CUDA and XPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149372
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2025-03-21 03:42:12 +00:00
a39bf846f5 [ONNX] Add draft_export as a strategy (#147529)
Create draft_export strategy.

The strategy is added before jit and after strict=True, as the third fallback. Since it is specializing tensors it should not be less robust than the jit trace strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147529
Approved by: https://github.com/titaiwangms
2025-03-21 03:05:17 +00:00
0692301e25 Catch OSError in general when writing files (#149464)
Redundant exception types in `except (PermissionError, OSError):`.  Write `except OSError:`, which catches exactly the same exceptions.

https://github.com/pytorch/pytorch/actions/runs/13935844871/job/39141062991

When hipify files, or writing cprofile files, PermissionError is not enough when the file is located in a place that is not writable at all, or other OS errors happened when writing files.

This fix makes the code more robust.

Example error log:
```log
  File "deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/ops/op_builder/builder.py", line 540, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/ops/op_builder/builder.py", line 587, in jit_load
    op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/cpp_extension.py", line 1597, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "torch/utils/cpp_extension.py", line 2031, in _jit_compile
    hipify_result = hipify_python.hipify(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 1167, in hipify
    preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs,
  File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result
    result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 940, in preprocessor
    output_source = RE_QUOTE_HEADER.sub(mk_repl('#include "{0}"', True), output_source)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 919, in repl
    preprocess_file_and_save_result(output_directory,
  File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result
    result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 986, in preprocessor
    with clean_ctx.open(fout_path, 'w', encoding='utf-8') as fout:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 123, in open
    return open(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 30] Read-only file system: 'deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149464
Approved by: https://github.com/janeyx99
2025-03-21 02:42:50 +00:00
362b40939d [ONNX] Improve docstring of onnx symbolic ops (#149668)
Better examples
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149668
Approved by: https://github.com/titaiwangms
2025-03-21 01:57:39 +00:00
66dd00fca0 Fix clang-tidy errors (#149581)
Summary: Cleanup clang-tidy complaints in `EmbeddingBag.cpp`: Avoid shadowed variables and unused parameters.

Test Plan: sandcastle

Differential Revision: D71512594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149581
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-03-21 01:53:57 +00:00
e481615bc7 [aot] always lower the backward with a deepcopy (#149229)
FIXES https://github.com/pytorch/pytorch/issues/149105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149229
Approved by: https://github.com/bdhirsh
2025-03-21 01:47:13 +00:00
5ebc283f2c [PT2] Port use_triton_dot_compress to PT2 pre_grad passes (#148517)
Summary: add use_triton_dot_compress in pre_grad

Test Plan:
```
scripts/aetk/aetk -L

%run ~/fbsource/fbcode/caffe2/test/inductor/fb/test_customized_triton_kernel_passes.py
```

Reviewed By: frank-wei

Differential Revision: D68909838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148517
Approved by: https://github.com/frank-wei
2025-03-21 01:42:32 +00:00
c2ada9d77b [easy] Do not logspam if static cuda launcher is disabled (#149669)
No need to log.info every time someone runs with StaticCudaLauncher disabled.

Test plan: Run any benchmark and see that we don't spam the bypass message in logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149669
Approved by: https://github.com/oulgen, https://github.com/jansel
ghstack dependencies: #148890
2025-03-21 01:22:26 +00:00
1099c37150 ci: Add sccache to manylinux images (#148419)
Adds sccache to our manylinux images, these are purposefully built
without the scccache-dist binary since we're not expecting to use that.

Another caveat of these builds is that they are built with the vendored
version of openssl.

This is to set the stage for us to be able to build binaries
sequentially.

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148419
Approved by: https://github.com/atalman
2025-03-21 01:15:34 +00:00
2975664fb0 add python root bin to windows load path. (#146573)
This PR is extend python root bin path to dll load list.
It makes PyTorch robust and compatible to more dependency libraries, such as `intel-pti`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146573
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-03-21 00:48:43 +00:00
90543e90a0 Fix broken dynamo_timed test due to python_version field (#149659)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149659
Approved by: https://github.com/ppanchalia
2025-03-21 00:27:28 +00:00
f47aa08130 [export] Support python assertion with symints. (#149444)
Summary: This diff ports some technique from torch.fx symbolic trace to trace through Python asserts when we run into data dependent symbolic shape assertions, so that we can achieve the same effect as torch dynamo to automatically turn assert into torch.check()s.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_python_asserts_with_sym_int
Differential Revision: D71425360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149444
Approved by: https://github.com/tugsbayasgalan
2025-03-20 23:07:45 +00:00
bf34e228c5 [export] Beef up guard_added logs (#149465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149465
Approved by: https://github.com/pianpwk
2025-03-20 23:02:07 +00:00
1d3c50fcc5 [Dynamo] Support the torch._C.DisableTorchFunction ctx manager (#149491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149491
Approved by: https://github.com/StrongerXi
ghstack dependencies: #149489, #149490
2025-03-20 22:19:55 +00:00
ce5adc5c05 [Dynamo] add support for torch._C._is_torch_function_all_disabled (#149490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149490
Approved by: https://github.com/StrongerXi
ghstack dependencies: #149489
2025-03-20 22:19:55 +00:00
f64c361860 [Dynamo] Refactor DisableTorchFunction ctx manager (#149489)
Refactors the DisableTorchFunction ctx manager to properly model the eager code (no args to the context manager).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149489
Approved by: https://github.com/StrongerXi
2025-03-20 22:19:55 +00:00
a268c29b9f [distributed] fix: use group rank instead of global rank when possible (#149488)
Fixes #149200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149488
Approved by: https://github.com/wconstab
2025-03-20 21:47:03 +00:00
b07b819912 [inductor] Add a helper for convert index_dtype to torch dtype (#149531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149531
Approved by: https://github.com/eellison
2025-03-20 21:33:29 +00:00
a703107f7b [AOTInductor] Fix skip cpp wrapper unit test (#149606)
Summary: as title

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test -- --exact 'deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_aoti_ep_called (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest)'
```
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --exact 'caffe2/test/inductor:cudagraph_trees_expandable_segments - test_skip_cpp_wrapper (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)'
```

https://www.internalfb.com/phabricator/paste/view/P1758059197

Reviewed By: henryoier

Differential Revision: D71528281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149606
Approved by: https://github.com/desertfire
2025-03-20 20:55:33 +00:00
406d464d97 Add is_batchedtensor to dynamo builder (#149541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149541
Approved by: https://github.com/zou3519
2025-03-20 20:46:15 +00:00
f17ae3f7b7 [Inductor Cutlass backend] Fix imports and compilation of Cutlass SM100 Kernels (#149515)
Summary: Fixes the import and compilation of Cutlass SM100 Kernels.

Test Plan: Cutlass backend unit tests, running benchmarks/inductor_backends/cutlass.py

Differential Revision: D71196747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149515
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-03-20 20:35:18 +00:00
24176f6e32 Revert "[cond] don't trace fw and bw graph in autograd key (#148930)"
This reverts commit 6e843a51dd5743b864fc28601ef06cdc18488b3e.

Reverted https://github.com/pytorch/pytorch/pull/148930 on behalf of https://github.com/ydwu4 due to Test failure is legit ([comment](https://github.com/pytorch/pytorch/pull/148930#issuecomment-2741585315))
2025-03-20 20:28:29 +00:00
4a4a71a73c [inductor]lowering scan to while_loop (#148580)
This PR add a pass in post_grad that lowers scan to while_loop. See the comment before the pass for how this is implemented.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148580
Approved by: https://github.com/jansel, https://github.com/eellison
2025-03-20 20:21:02 +00:00
6e843a51dd [cond] don't trace fw and bw graph in autograd key (#148930)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148930
Approved by: https://github.com/zou3519
2025-03-20 20:18:29 +00:00
18435945af Set __context__/__cause__ when generator raise StopIteration (#148765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148765
Approved by: https://github.com/zou3519
ghstack dependencies: #146505
2025-03-20 19:59:30 +00:00
44e6464914 Allow setting attribute to NestedUserFunctionVariable (#146505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146505
Approved by: https://github.com/zou3519
2025-03-20 19:59:30 +00:00
aae4c0729e Fix broken build within xplat/caffe2 (#149403)
Summary:
Following a pull from open source, the build within xplat is broken
due to not finding <autograd/function.h>.

Within the python_function.cpp there seems to be a convention of using the
torch/csrc prefix.

This change includes that prefix to enable the build to proceed.

Test Plan:
Build a binary using torch.

https://www.internalfb.com/buck2/83122485-d3c3-43f4-97b4-81bb90450b3b

Unit tests run too

https://www.internalfb.com/intern/testinfra/testrun/13229323975828416

Further testing in CI and elsewise expected.

Reviewed By: malfet

Differential Revision: D70331539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149403
Approved by: https://github.com/izaitsevfb

Co-authored-by: Dominic Binks <dbinks@meta.com>
2025-03-20 19:27:55 +00:00
ffa085334c Specify the default PyTorch Distributed backend for MPS (#149538)
Fixes #149537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149538
Approved by: https://github.com/d4l3k, https://github.com/malfet
2025-03-20 18:54:03 +00:00
1d221724fc fix missing field initializer warning (#149597)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149597
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-03-20 18:48:05 +00:00
6285a71aba [dynamo] fix bug where non-recursive disable modifies the original function (#148896)
Fixes https://github.com/pytorch/pytorch/issues/148787.

We fix this by:
- Wrapping the original function instead of directly modifying it
- When we detect that the previous frame is the non-recursive disable wrapper, then skip tracing this frame (non-recursive disable wrapper will always be skipped, so that frame will be present in the traceback)l

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148896
Approved by: https://github.com/jansel
2025-03-20 18:33:54 +00:00
88a26dbb9d [BE] simplify test_cpp_extensions_aot and .gitignore (#149231)
It is shady to clean up an install mid-test. So don't do that anymore and use .gitignore instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149231
Approved by: https://github.com/albanD, https://github.com/msaroufim
2025-03-20 18:17:19 +00:00
b99fc9d29f [MTIA] Support loading Tensors on mtia:0 for pytorch code (#149327)
Summary: The diff includes updates to the PyTorch code to enable loading tensors to MTIA.

Reviewed By: PatriceVignola

Differential Revision: D71176848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149327
Approved by: https://github.com/ezyang
2025-03-20 18:05:15 +00:00
7bb9c36784 Hook StaticCudaLauncher up to torch.compile (cold start) (#148890)
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.

Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:
- The kernel is a cuda kernel and inductor can find a cubin file associated with it
- The kernel takes less than 50 arguments
- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)
- The kernel is not user defined (to be supported in a later PR)

We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.

Some key features of StaticTritonCompileResult:
- It is fully serializable
- It stores the minimum amount of stuff, so that later it can be cached easily
- It does not depend on any triton specific types (though it does have various triton metadata).

For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.

Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.

Fixes #149448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890
Approved by: https://github.com/jansel
2025-03-20 17:32:20 +00:00
c99efc08fb [ROCm] skip test_RNN_dropout_state (#149446)
PR to skip test_nn.py::TestNN::test_RNN_dropout_state
Currently ROCm doesn't support dropout value for RNN

PR to enable RNN dropout on ROCm still in review and blocked pytorch/pytorch#144572

Fixes: https://github.com/pytorch/pytorch/issues/68849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149446
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-03-20 17:22:39 +00:00
1d9401befc ci: Remove mentions and usages of DESIRED_DEVTOOLSET and cxx11 (#149443)
This is a remnant of our migration to manylinux2_28 we should remove
these since all of our binary builds are now built with cxx11_abi

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2025-03-20 16:49:46 +00:00
6237495fcf torch.Size input (#149414)
Summary: Support for `torch.Size` inputs was patchy before because `unflatten_fn` for this type returned a tuple. This PR cleans this up.

Fixes #149158

Test Plan: added test

Differential Revision: D71403635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149414
Approved by: https://github.com/yushangdi
2025-03-20 16:23:13 +00:00
2c4bc65366 [aotd] Guess tangents stride as output strides (#144579)
AOTDispatch  doing AOT backward graph preparation does not know real tangents that user will specify when runs backward.

AOTD guesses the tangents. Before - we guessed that memory format of tangents will be as memory format of corresponding outputs. And if specified tangents at runtime are not the same memory format as we guessed during compilation, AOTD does coercion (copy) to guessed memory_format

But as Horace found, there are popular use cases, where the outputs of compiled region will be in specific memory_format. E.g. in 4D tensor transposing dims 1 and 2.

https://github.com/karpathy/nanoGPT/blob/master/model.py#L57

This PR changes the logic, that AOTD expects the same "strideness" of tangents as outputs. As a result it will avoid coercion for the case of transposed dims.

Limitations:
We keep guessing memory_format for:
1/ Dynamic shapes (needs more changes)
2/ Tensor subclasses (needs more changes)

Other changes:
test_torchinductor was always creating contiguous tangents via `torch.randn()`, changing them to be `torch.randn_like()` to compare computation with the same strideness.

(E.g. for cuda float16 strideness affects numerics for fft ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144579
Approved by: https://github.com/bdhirsh
2025-03-20 15:41:36 +00:00
9b1127437e Add triton as dependency to CUDA aarch64 build (#149584)
Aarch64 Triton build was added by: https://github.com/pytorch/pytorch/pull/148705
Hence add proper contrain to CUDA 12.8 Aarch64 build

Please note we want to still use:
```platform_system == 'Linux' and platform_machine == 'x86_64'```
For all other builds.

Since these are prototype binaries only used by cuda 12.8 linux aarch64 build. Which we would like to serve from download.pytorch.org

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149584
Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-20 15:39:45 +00:00
80dfce2cc3 [export] Handle non OpNamespace type during decomposition. (#149431)
Summary:
Turns out we can have non OpNamespace object in torch.ops._dir.

We should just throw away those during iteration.

Test Plan: eyes

Differential Revision: D71417992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149431
Approved by: https://github.com/tugsbayasgalan
2025-03-20 15:36:15 +00:00
d67c1a027e [Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149473)
# Motivation
The PR fix a bug that wrongly decides the zero-point mask setting. Specifically, it deems zero-point is always not zeros due to scale is used for judgement. Fortunately, the bug only affects the performance. The accuracy is not affected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149473
Approved by: https://github.com/EikanWang, https://github.com/guangyey
2025-03-20 14:46:07 +00:00
496bbf38be add grad_output shape check for adaptive_avg_pool2d_backward (#145241)
Fix https://github.com/pytorch/pytorch/issues/145070.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145241
Approved by: https://github.com/malfet, https://github.com/eqy
2025-03-20 14:10:31 +00:00
00a2c68f67 Fix a typo "trochrec" to "torchrec" (#149542)
Summary: As titled, the path is incorrect due to the typo

Test Plan: CI

Differential Revision: D71490709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149542
Approved by: https://github.com/williamwen42
2025-03-20 10:14:23 +00:00
a66a9581da [dynamo] support Python 3.13t (#149549)
A few bug fixes to get Dynamo mostly working with 3.13 nogil. Dynamo encounters internal CPython assert errors in older versions of 3.13. The fix has been landed on [CPython's 3.13 branch](https://github.com/python/cpython/tree/3.13) and will be included in 3.13.3 (https://peps.python.org/pep-0719/ - april 8). If you wish to try `torch.compile` on the latest 3.13 branch, you can comment out the error checking (i.e. 70b6cd4e11/torch/__init__.py (L2535) and 70b6cd4e11/torch/_dynamo/eval_frame.py (L899)).

We will work on getting PyTorch CI up for Dynamo/dynamo-wrapped/inductor once 3.13.3 is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149549
Approved by: https://github.com/jansel
2025-03-20 09:49:27 +00:00
970ac2d907 [Inductor] Improve memory locality by iterating over y dimension before x (#149339)
# Feature

Fixes https://github.com/pytorch/pytorch/issues/148718 by reordering the tensor dims to `(z, y, x)`.

As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order.

# Perf impact

It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).)

Training (all neutral or positive):
![image](https://github.com/user-attachments/assets/57f1ef1d-60b4-446f-baf3-aca87a26b81b)

Inference (one positive, one very small negative):
![image](https://github.com/user-attachments/assets/679aa057-af23-47f1-8d8e-8520daf1bd92)

As reported in https://github.com/pytorch/pytorch/issues/148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html):

> Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.).

I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache.

> The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes.

The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example.

Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel.

# Test plan
 - Updated expected code on CI tests.
 - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149339
Approved by: https://github.com/jansel
2025-03-20 08:12:00 +00:00
3647711a89 [AOTI][refactor] Remove dead code (#149287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149287
Approved by: https://github.com/cyyever, https://github.com/yushangdi
2025-03-20 07:29:27 +00:00
90ef7a9561 Revert "Supporting non-tensor-data write_size in planner write items. (#149434)"
This reverts commit 1442230a267f0ce4f0bb540fca775faa71e7cfd5.

Reverted https://github.com/pytorch/pytorch/pull/149434 on behalf of https://github.com/izaitsevfb due to breaking docs build ([comment](https://github.com/pytorch/pytorch/pull/149434#issuecomment-2739378287))
2025-03-20 06:52:02 +00:00
00333c4548 [Inductor] Set prop_kind to forward_inference when grad is not needed for mkldnn_linear_pointwise and mkldnn_convolution_pointwise (#147072)
Summary:
The `prop_kind` of `mkldnn._linear_pointwise`, `mkldnn._linear_pointwise.binary`, `mkldnn._convolution_pointwise.binary` and `mkldnn._convolution_pointwise_.binary` are always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether `grad` is needed. Setting `prop_kind` to `dnnl_forward_inference` for these ops when `grad` is not needed could have better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147072
Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/jansel
2025-03-20 06:21:31 +00:00
c4d59e6279 [Inductor] Fix combo_kernel logging error (#149575)
Summary:
Fix logging error like:
```
in combinable_nodes
    log.debug(
Message: 'ComboKernels: %d template nodes are filtered'
Arguments: (OrderedSet([8]),)
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/data/users/guorachel/fbsource/buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark#link-tree/torch/_logging/_internal.py", line 818, in format
    record.message = record.getMessage()
  File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: %d format: a real number is required, not OrderedSet
```

encountered in running a prod model + enable combo kernel feature

Test Plan: CI

Differential Revision: D71512220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149575
Approved by: https://github.com/ColinPeppler
2025-03-20 06:09:44 +00:00
595293316d [MPS/Inductor] Add support for modified_bessel_k0. (#149593)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149593
Approved by: https://github.com/jansel
2025-03-20 04:51:44 +00:00
9a184b1074 Monkeypatch fake mode so it errors on invalid custom ops (#149410)
Internal version: [D71294776](https://www.internalfb.com/diff/D71294776)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149410
Approved by: https://github.com/gmagogsfm
2025-03-20 04:50:57 +00:00
fe94d7da1a [Inductor][Optimus] Add move view after cat aten pattern (#149178)
Summary:
Add aten pattern to move the view/reshape out of split cat, further reduce the number of kernels.

context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0

Test Plan:
### how to enable
Add the following patterns to the post grad
```
        post_grad_fusion_options={
            "normalization_aten_pass": {},
            "move_view_after_cat_aten_pass": {},
        },
```

### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_move_view_after_cat_aten
```

Buck UI: https://www.internalfb.com/buck2/3c5451be-c63a-4794-8d6b-103ecac78905
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449704507267

### local reproduce

```
buck2 run mode/opt scripts/shuaiyang:test -- --flow_id 691990503 --use_synthetic_data --optimus
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2025-03-13-20-59-34/trace.json.gz&bucket=gpu_traces

### E2E

baseline

f691990503

proposal

Differential Revision: D71177004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149178
Approved by: https://github.com/Yuzhen11
2025-03-20 04:07:25 +00:00
95e71765f2 [MPS] nanmedian implementation (#149407)
Implements nanmedian on MPS. This implementation only implements `torch.nanmedian(tensor)` without `keepdim` and `dim`
Will implement nanmedian with dim and keepdim in a followup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149407
Approved by: https://github.com/malfet
2025-03-20 03:50:26 +00:00
cca46a0b6f Fix score_mod.py dynamic max autotune (#148991)
python benchmarks/transformer/score_mod.py --dynamic --max-autotune

previously would crash with

```
"/home/bobren/local/a/pytorch/torch/_inductor/select_algorithm.py", line 2306, in key_of
    node.get_device().type,

```

but with this change no longer does

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148991
Approved by: https://github.com/drisspg
2025-03-20 03:28:51 +00:00
bc1b8730a4 [Windows][inductor] fix blank space break windows file path (#149388)
Fixes #149310

From origin error message:
```cmd
Command:
cl /I C:/Program Files/Python310/Include /I c:/code/.env/lib/site-packages/torch/include /I c:/code/.env/lib/site-packages/torch/include/torch/csrc/api/include /I c:/code/.env/lib/site-packages/torch/include/TH /I c:/code/.env/lib/site-packages/torch/include/THC /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp /LD /FeC:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.pyd /link /LIBPATH:c:/code/.env/Scripts/libs /LIBPATH:c:/code/.env/lib/site-packages/torch/lib torch.lib torch_cpu.lib torch_python.lib sleef.lib

Output:
Microsoft (R) C/C++ Optimizing Compiler Version 19.43.34809 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental'
cl : Command line warning D9024 : unrecognized source file type 'Files/Python310/Include', object file assumed
coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp
C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp(21): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
```
Python installed in `C:/Program Files/Python310` path, and the blank space break the file path.

Solution:
Add quotes to declare Windows file paths, after that:
```cmd
cl /I "C:/Users/Xuhan/.conda/envs/new_build/Include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include/torch/csrc/api/include"  /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D  C10_USING_CUSTOM_GENERATED_MACROS /D CPU_CAPABILITY_AVX512  /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental  C:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.cpp  /arch:AVX512  /FeC:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.pyd /LD /link /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/libs" /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/lib"  "torch.lib" "torch_cpu.lib" "torch_python.lib" "sleef.lib"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149388
Approved by: https://github.com/jansel
2025-03-20 03:10:30 +00:00
45a879e55b xpu: improve error handling and reporting in XPU cmake files (#149353)
For #149075

* Add a graceful cmake error instead of cryptic one if SYCL runtime is not found:
```
The link interface of target "c10_xpu" contains:

    torch::xpurt

  but the target was not found.
```
* Suppress unclear cmake error if SYCL compiler is not available and further version query fails:
```
CMake Error at /home/dvrogozh/pytorch/torch/share/cmake/Caffe2/FindSYCLToolkit.cmake:37 (string):
  string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
  command.
```

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149353
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-03-20 02:00:39 +00:00
3b7bd6c63d Fix dynamic shapes repordering bug (#149528)
WHen we create constraints, we look at the ordering of kwargs according to model signature. But when we trace, we use the ordering that is created based on how user passes in their kwargs. As a result, constraints and dynamic shapes end up having a different order causing issues when they have different dynamic tensor specs.

Differential Revision: [D71478578](https://our.internmc.facebook.com/intern/diff/D71478578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149528
Approved by: https://github.com/ydwu4
2025-03-20 01:57:44 +00:00
1e30192b19 [logging] Add python version to dynamo_compile table (#149419)
Summary: This adds a version field like the following: `3.10.9+fb (3.10:1dd9be6, May  4 2022, 01:23:45) [Clang 15.0.7 (mononoke://mononoke.internal.tfbnw.net/fbsource 5d1601b0eed7426ac`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149419
Approved by: https://github.com/c00w
2025-03-20 01:48:34 +00:00
1442230a26 Supporting non-tensor-data write_size in planner write items. (#149434)
Summary:
1\ The current write item structure does not contain the amount of data that needs to be written.
2\ the planner.item already has a size primitive 'tensor_storage_size'. https://fburl.com/code/7a0gsmw7 But only for tensors.
3\ Right now, the only way the writer layer get hold of this property (fro non tensor data)

- first do a lookup in to the actual tensor/bytes
- then calculate the nbytes.
This change introduce a way to capture non-tensor data  size within a write-plan item.

Reviewed By: daulet-askarov

Differential Revision: D70497442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149434
Approved by: https://github.com/MeetVadakkanchery
2025-03-20 01:22:05 +00:00
02e21c7854 Fix spelling (#149277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149277
Approved by: https://github.com/zou3519
2025-03-20 01:02:32 +00:00
826e790696 Revert "ci: Remove mentions and usages of DESIRED_DEVTOOLSET (#149443)"
This reverts commit 95a633c45304755ebdbc08396d9948d34243ddb3.

Reverted https://github.com/pytorch/pytorch/pull/149443 on behalf of https://github.com/izaitsevfb due to fails lint ([comment](https://github.com/pytorch/pytorch/pull/149443#issuecomment-2738709561))
2025-03-20 00:59:41 +00:00
95a633c453 ci: Remove mentions and usages of DESIRED_DEVTOOLSET (#149443)
This is a remnant of our migration to manylinux2_28 we should remove
these since all of our binary builds are now built with cxx11_abi

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149443
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2025-03-20 00:39:02 +00:00
cyy
29c4f2c07a Remove Ubuntu 18.04 scripts (#149479)
Ubuntu 18.04 end of life reached on May 31, 2023. These code isn't used now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149479
Approved by: https://github.com/malfet
2025-03-20 00:13:40 +00:00
6cbf97ede8 [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/izaitsevfb

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 23:42:35 +00:00
2be97c7257 Update nightly s390x builds (#149337)
This change should fix new nightly build failures for s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149337
Approved by: https://github.com/malfet
2025-03-19 23:27:14 +00:00
c9de76a1e4 Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540)
1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: https://github.com/pytorch/pytorch/pull/149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: https://github.com/pytorch/pytorch/pull/148895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
2025-03-19 23:20:05 +00:00
5005e1bc47 support multinomial for dynamic num_samples (#149463)
Test Plan: added test

Fixes #149048

Differential Revision: D71434914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149463
Approved by: https://github.com/pianpwk
2025-03-19 23:15:29 +00:00
cc469aaf3b [CI][docker] Remove vulkan and swiftshader from docker builds (#149530)
Probably should have been removed with https://github.com/pytorch/pytorch/pull/139354/files?

Should I also remove mentions of them from build.sh and test.sh?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149530
Approved by: https://github.com/malfet
2025-03-19 23:13:27 +00:00
88c2fe533f [MPS] Add modified_bessel_k0 support to eager. (#149563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149563
Approved by: https://github.com/malfet
2025-03-19 23:10:55 +00:00
bc86b6c55a Update ExecuTorch pin update (#149539)
Latest commit in https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50

Follow-up to https://github.com/pytorch/pytorch/issues/144480#issuecomment-2731150636

Also, need to incorporate change from https://github.com/pytorch/executorch/pull/8817

Test Plan:

Monitor  linux-jammy-py3-clang12-executorch test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149539
Approved by: https://github.com/larryliu0820
2025-03-19 22:29:59 +00:00
6974ba84f6 [ci][anaconda] Remove conda from linter docker images (#147789)
Remove conda usage from the linter docker images

Handles part of https://github.com/pytorch/pytorch/issues/148110
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147789
Approved by: https://github.com/atalman
2025-03-19 21:56:44 +00:00
a11538aa46 [GPU Snapshot] Add Clear History Flag (#149352)
Summary:
Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots.

As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history.

A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp.

Test Plan:
See diff below

Differential Revision: D71159720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352
Approved by: https://github.com/eqy, https://github.com/aaronenyeshi
2025-03-19 21:44:20 +00:00
e1d143cb7b Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit ee1a2b7810126258ce64d1e22b59fae81a3f7bcb.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/izaitsevfb due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2738115728))
2025-03-19 21:12:13 +00:00
37bb7f79c6 [ROCm][TunableOp] Unit test for TunableOp BLAS logging. (#148982)
Add unit test for new TunableOp BLAS logging feature.

Requires this PR to be merged in first: https://github.com/pytorch/pytorch/pull/148979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148982
Approved by: https://github.com/jeffdaily
2025-03-19 20:57:19 +00:00
71daeddde2 [MTIA] Ensure correct stream behavior for input_buffer add autograd on MTIA (#149433)
Test Plan: CI

Differential Revision: D71414498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149433
Approved by: https://github.com/albanD
2025-03-19 20:19:18 +00:00
fae79e91a0 Remove torch.export.export_for_inference (#149078)
Summary: Remove torch.export.export_for_inference, it is redundant and can always be replaced with torch.export.export_for_training() + run_decompositions()

Test Plan: unit tests

Differential Revision: D71069057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149078
Approved by: https://github.com/tugsbayasgalan
2025-03-19 19:57:18 +00:00
05fee772e5 Fix with effect lowering for list return type (#149510)
Summary: - For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_aot_compile

buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r list_return

buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind # tested together with D70013257

buck run fbcode//mode/dev-nosan //caffe2/test:test_export  -- -r test_custom_obj
```

Reviewed By: angelayi

Differential Revision: D71346024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149510
Approved by: https://github.com/zou3519
2025-03-19 19:35:08 +00:00
842a072fd3 [codemod] Fix clang-tidy command line doc comments (#149524)
Summary:
Fixes the comments to match the latest updates to the checked-in tools.

Search/replace applied in this order:
* `# /fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks`
* `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `# ~/fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks`
* `fbsource/tools/lint/clangtidy/clang-tidy-platform010 -list-checks` -> `fbsource/tools/lint/clangtidy/clang-tidy-platform010-clang-17 -list-checks`

Test Plan: CI

Reviewed By: johnkearney

Differential Revision: D71431516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149524
Approved by: https://github.com/janeyx99
2025-03-19 19:22:11 +00:00
96828a2155 [export] refactor DimHints for type errors (#149424)
Differential Revision: D71414367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149424
Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri
2025-03-19 18:51:07 +00:00
9ec9f4740c [export] fix stft decomp and making it consistent with cpp impl. (#149232)
Summary: We change the fake impl of stft to follow more closely with its cpp implementation [here](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/SpectralOps.cpp#L951-L963)

where  " n_frames = 1 + (len - n_fft) / hop_length;" is also an integer division.

Test Plan: Existing tests and buck2 build --flagfile fbcode//mode/dev fbcode//executorch/examples/models/fb/llama4:speech_transform.pte

Differential Revision: D71209142

edit: we kept the original path un-changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149232
Approved by: https://github.com/jackzhxng
2025-03-19 18:40:35 +00:00
94d761fbf0 [AOTI][reland] Update test runner to use the new APIs (#149412)
Summary: Reland https://github.com/pytorch/pytorch/pull/147105. Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring.

Differential Revision: [D71470265](https://our.internmc.facebook.com/intern/diff/D71470265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149412
Approved by: https://github.com/yushangdi
2025-03-19 17:56:44 +00:00
d686d04c2f [custom_ops][perf] Move expensive pytree traversals of tensors to C++ (#148555)
(benchmark for 1 call)

Before:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 77.72445678710938 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 64.61143493652344 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 11.682510375976562 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 18.596649169921875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

After:
```
└─ $ python ~/task_custom_ops_perf/test_custom_ops_perf_repro.py
DO_BENCH mutate: 47.6837158203125 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/mutate.json
DO_BENCH no_mutate: 31.709671020507812 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/no_mutate.json
DO_BENCH direct_mutate: 10.967254638671875 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_mutate.json
DO_BENCH direct_no_mutate: 10.728836059570312 us PROFILE:/home/ivankobzarev/task_custom_ops_perf/direct_no_mutate.json
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148555
Approved by: https://github.com/zou3519
2025-03-19 17:16:57 +00:00
518563d6ef Add release branch push triggers to rocm-mi300.yml (#149517)
When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149517
Approved by: https://github.com/atalman
2025-03-19 16:14:09 +00:00
e98afa0f89 [Sigmoid] Remove magic method in CapabilityBasedPartitioner (#149400)
Summary: As title.

Test Plan: CI

Differential Revision: D70575197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149400
Approved by: https://github.com/jfix71
2025-03-19 16:02:43 +00:00
4df66e0b7f Pin auditwheel to 6.2.0 (#149471)
Observing aarch64 failure in nightly:
https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228

Similar to: https://github.com/pytorch/vision/pull/8982

```
2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel
2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl
2025-03-18T08:45:20.3393288Z Traceback (most recent call last):
2025-03-18T08:45:20.3393732Z   File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module>
2025-03-18T08:45:20.3394115Z     sys.exit(main())
2025-03-18T08:45:20.3394559Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main
2025-03-18T08:45:20.3395064Z     result: int | None = args.func(args, p)
2025-03-18T08:45:20.3395626Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute
2025-03-18T08:45:20.3396163Z     out_wheel = repair_wheel(
2025-03-18T08:45:20.3396657Z   File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel
2025-03-18T08:45:20.3397184Z     raise ValueError(msg)
2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located
2025-03-18T08:45:20.3678843Z Traceback (most recent call last):
2025-03-18T08:45:20.3679267Z   File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module>
2025-03-18T08:45:20.3680988Z     pytorch_wheel_name = complete_wheel("/pytorch/")
2025-03-18T08:45:20.3681449Z   File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel
2025-03-18T08:45:20.3681976Z     check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)
2025-03-18T08:45:20.3682860Z   File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call
2025-03-18T08:45:20.3683308Z     raise CalledProcessError(retcode, cmd)
2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1.
2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1.
2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main
2025-03-18T08:45:20.3862448Z with:
```

Please note aarch64 CUDA failures are related to: https://github.com/pytorch/pytorch/pull/149351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149471
Approved by: https://github.com/malfet
2025-03-19 15:55:05 +00:00
1bf443e2f2 [aoti x with_effect token] Unbacked symint and register lowering (#147656)
Differential Revision: D70022208

- When resolving unbacked symints in ExternKernel for with_effect, we need to ignore the first item in the binding path, because the `example_output` doesn't contain the effect token, but the binding paths do.
- Similarly, `node.meta["val"]` contains the effect token, so when we compute_unbacked_bindings, we need to remove that effect token

- For `torch.ops.higher_order.with_effects`'s lowering, we should not extract the items out of an list (i.e. `*result` vs `result`). The `get_attr` nodes consider the result to be in the list format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147656
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-03-19 14:38:30 +00:00
2fcfae72b4 async fx compile (#146135)
Adds the ability to run the selected out-of-process fx compile scheme in async mode - where we kick off the compile and then run eagerly until the compile is finished.

Added a test which runs a tiny model in a loop making sure that we execute it both eagerly and then compiled.

Differential Revision: [D71135546](https://our.internmc.facebook.com/intern/diff/D71135546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146135
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-03-19 14:07:51 +00:00
1dce65a82c Fix the invalid link for FX (#149289)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149289
Approved by: https://github.com/zou3519
2025-03-19 14:03:18 +00:00
97910b6c00 Update s390x docker image (#148444)
New releases of ml_dtypes successfully build on s390x, skip building patched old release.
Unpin grpcio version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148444
Approved by: https://github.com/seemethere
2025-03-19 12:25:10 +00:00
7ca296f564 Document patched podman build for s390x runners (#147618)
Podman patches from upstream are needed to resolve a couple of issues hit when using it.
Document automated build of podman
with applied patches fixing those issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147618
Approved by: https://github.com/seemethere
2025-03-19 12:25:05 +00:00
cfbeaf7b7e Improve docker build cleanup on s390x runners (#149316)
Currently it sometimes still leaves a couple of processess running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149316
Approved by: https://github.com/seemethere
2025-03-19 10:10:44 +00:00
466d5295c1 Fixed abnormal behavior of LazyLinear when using LayzLinear and load_state together (#147599)
Update Points:
- Update the logic of ``initialize_parameters``
- Add new testcases

The ISSUE Related:
https://github.com/pytorch/pytorch/issues/147389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147599
Approved by: https://github.com/mikaylagawarecki
2025-03-19 10:01:12 +00:00
8bf3f3fc43 [c10d] Add a collective time estimator for NCCL comms (#149343)
We want to upstream the feature from new nccl for users to estimate comm time.

Resolves #147753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149343
Approved by: https://github.com/kwen2501
2025-03-19 07:54:02 +00:00
b963d96bad [Torchscript] Add a flag to use mangled names instead of demangled (#148906)
Summary: Optionally keep mangled names when expanding torchscript stacks

Test Plan:
```
buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_generate --show-full-output

/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/0bd9d136228ad8a7/scripts/rihams/LearnPyTorch/__torch_script_generate__/torch_script_generate.par

buck2 build mode/opt //scripts/rihams/LearnPyTorch:torch_script_execute --show-full-output
```

- With `--torch_jit_expanded_stacks_mangled` Flag:

/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute fbcode/model.pt  --torch_jit_expanded_stacks_mangled --torch_jit_enable_expanded_stacks

https://fburl.com/scuba/strobelight_function_tracer/8die4rvm

{F1975933247}

Without Flag:

/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/ef35e45045e8164c/scripts/rihams/LearnPyTorch/__torch_script_execute__/torch_script_execute ./model.pt   --torch_jit_enable_expanded_stacks

https://fburl.com/scuba/strobelight_function_tracer/x3nladpf

 {F1975933268}

Reviewed By: bbus

Differential Revision: D70905872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148906
Approved by: https://github.com/zdevito
2025-03-19 07:53:02 +00:00
3e78c9e967 [ROCm][Windows] Disable hipSPARSE and CK declarations and remove references for Windows (#149195)
This PR removes references to `hipSPARSE` and `ck` functions and disables declarations which are not supported on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149195
Approved by: https://github.com/jeffdaily

Co-authored-by: Michal Gallus <Michal.Gallus@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 07:30:53 +00:00
2cb42f26c1 Remove test_get_model_state_dict_del_memory (#149460)
test_get_model_state_dict_del_memory get unexpected memory, leading to the test failures.
Remove tests right now to avoid blocking the others.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149460
Approved by: https://github.com/fegin
2025-03-19 07:06:46 +00:00
e8a35eb7da Add Missing Communication collectives (#147379)
----

- reduce_add_coalesced
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147379
Approved by: https://github.com/mikaylagawarecki
2025-03-19 06:59:04 +00:00
981807cfcb [Inductor][Optimus] split cat aten pass (#149027)
Summary:
We add the aten pattern to optimize big cat node with arbitrary order of inputs to support APS jobs

context: https://docs.google.com/document/d/1G2qFcQu1K7VXbz2uPe0CS2aBirnwtwI_B8lxmlBlAPQ/edit?tab=t.0

Test Plan:
### how to enable
Add the following patterns to the post grad
```
        post_grad_fusion_options={
            "normalization_aten_pass": {},
            "split_cat_aten_pass": {"threshold_to_cat": 10},
        },
```
You can tune threshold_to_cat to achieve best performance. If nothing gives, the default value 10 will be used

### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/9e52168d-c107-4be8-a46b-b9d239f5c50d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923605061752
Network: Up: 112KiB  Down: 132KiB  (reSessionID-915796e0-4a8f-486a-9f63-afb1e191d24a)
Executing actions. Remaining     0/3                                                                                   1.0s exec time total
Command: test.     Finished 2 local
Time elapsed: 4:57.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

baseline

f691990503

proposal

Differential Revision: D71017436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149027
Approved by: https://github.com/Yuzhen11
2025-03-19 06:01:05 +00:00
f123f2c077 [ca] fix dce for side-effects (#149336)
The AOT backward could have contained side effectful ops, so we can't DCE them. Have CA also call the default fx.Node.is_impure which will cover some of the existing cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149336
Approved by: https://github.com/jansel
2025-03-19 05:56:47 +00:00
ddb076591d [executorch hash update] update the pinned executorch hash (#147422)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147422
Approved by: https://github.com/pytorchbot
2025-03-19 05:22:35 +00:00
42bd4a09a3 [MTIA] Add _mtia_getCurrentRawStream to MTIA module (#149436)
Summary: The FlexAttention path generates code that uses this function. Although streams are not used yet in Triton-MTIA, adding this now allows us to not branch out just for MTIA and generate different code.

Test Plan: CI

Reviewed By: chaos5958

Differential Revision: D70072057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149436
Approved by: https://github.com/chaos5958
2025-03-19 05:17:51 +00:00
ef93cdfb8a [audio hash update] update the pinned audio hash (#149467)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149467
Approved by: https://github.com/pytorchbot
2025-03-19 04:28:57 +00:00
ee1a2b7810 [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 03:59:55 +00:00
20874a1f46 debug ival swap (#149206)
Summary:
Recall that we use "ivals" to track intermediate values of mutations during unflattening. Previously, for each such intermediate value, we would create a hidden shared attribute that would be updated / read by respective submodules.

Unfortunately this scheme doesn't work when some but not all of those submodules are swapped out. This is because the swapped in submodules have no knowledge of these hidden attributes. Thus the submodules that are not swapped out end up reading / updating dangling state.

This PR does away with these hidden attributes. Instead, we directly read the underlying buffer or placeholder that was updated, and update those underlying buffers and placeholders in place. This makes the graphs look much closer to their eager origins.

Test Plan: added some tests, ensured existing tests pass

Differential Revision: D71203469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149206
Approved by: https://github.com/tugsbayasgalan
2025-03-19 03:43:30 +00:00
14dc6e732d Cache the get_device_module result (#149207)
Summary: As title.

Test Plan: OSS CIs.

Reviewed By: chaos5958

Differential Revision: D71084180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149207
Approved by: https://github.com/jansel
2025-03-19 03:20:38 +00:00
01a57981aa [export] Add TracingContext (#149294)
TracingContext is added to all tracing locations -- in torch.export this is where we call make_fx (for training IR) and aot_export_module (for inference IR), and in run_decompositions where we call aot_export_module

Differential Revision: [D71298927](https://our.internmc.facebook.com/intern/diff/D71298927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149294
Approved by: https://github.com/ydwu4
2025-03-19 03:11:08 +00:00
a3c286677b [compile] Switch off inference mode during compilation (#149321)
PR does following
* Turns `inference_mode` to False and `no_grad` for `convert_frame`, if the inference_mode is on globally.
* Turns off inference_mode for fake tensor prop. This ensures that converting from real inference tensor to a fake tensor removes the inference-ness.
* Graph breaks on is_inference and is_inference_mode_enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149321
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-03-19 02:45:27 +00:00
04e251a7dd [AOTI] Add num_runners to AOTIModelPackageLoader (#149364)
Summary: AOTIModelContainerRunner takes a num_runners argument for multi-threaded inference, but AOTIModelPackageLoader forgot to take the same parameter, although its run() API already expects to take an optional cudaStream_t parameter for multi-threaded inference.

Differential Revision: [D71357418](https://our.internmc.facebook.com/intern/diff/D71357418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149364
Approved by: https://github.com/angelayi
2025-03-19 02:28:06 +00:00
536c0c7a47 [codemod][lowrisk] Remove unused exception parameter from caffe2/aten/src/ATen/cuda/CUDABlas.cpp (#149328)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: dtolnay

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149328
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-03-19 02:05:33 +00:00
919d54b7b1 Fix format string in ck_gemm_template.h for int64_t variables (#149438)
Summary:
Change %d to %ld in printf format specifier to correctly handle int64_t variables n, m, k.
This fixes compilation errors in HIP builds where the format string didn't match the argument type.

forward fix for D71412006

```
In file included from fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_bfloat16.hip:4:
fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:28: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat]
  385 |         printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n",
      |                                  ~~
      |                                  %ld
  386 |                         n, m, k,TRANSA, TRANSB);
      |                            ^
fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:31: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat]
  385 |         printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n",
      |                                     ~~
      |                                     %ld
  386 |                         n, m, k,TRANSA, TRANSB);
      |                               ^
fbcode/caffe2/aten/src/ATen/native/hip/ck_gemm_template.h:386:25: error: format specifies type 'int' but the argument has type 'int64_t' (aka 'long') [-Werror,-Wformat]
  385 |         printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n",
      |                               ~~
      |                               %ld
  386 |                         n, m, k,TRANSA, TRANSB);
      |                         ^
```

Test Plan:
```
buck2 build --flagfile fbcode//mode/opt-amd-gpu fbcode//torchrec/sparse/tests:test_jagged_tensor_gpu
```

Differential Revision: D71418611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149438
Approved by: https://github.com/ZainRizvi
2025-03-19 01:46:34 +00:00
6bcf9c6ce3 [xnnpack] Expose subgraph symbols (#149397)
Summary: Main XNNPack target code uses symbols from subgraph so they need to be exported - this gets uncovered on macos where symbols were not visible after linking

Test Plan: CI / used for a macOS build on top of the stack.

Differential Revision: D71315023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149397
Approved by: https://github.com/digantdesai
2025-03-19 01:14:46 +00:00
11d4438a5f [ROCm][TunableOp] More TF32 support. (#149088)
This PR includes additional enhancements to TF32 support in TunableOp.
- OpSignature now differentiates between float32 and tf32 data types.
- Offline tuning now supports TF32.
- Unit tests for online and offline tuning of TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 00:26:20 +00:00
268de64005 [ROCm][Windows] Enable torchvision build with ROCm on Windows (#147382)
- Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library)
- Set hipcc call for Windows case
- Removed CUDA flags (not used in ROCm) on Windows
- Updated Windows compiler (added case when using ROCm on Windows)
- Fixed path issue in hipify_python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-18 23:37:05 +00:00
61a64c20c4 [MPSInductor] Move threadfence at the right location (#149437)
Not sure how it worked in the past, but fence should be before first read from the shared memory, not after it.
This bug was exposed by https://github.com/pytorch/pytorch/pull/148969 which removed unnecessary barrier before calling `threadgroup_reduce` functions
Test plan:
```
% python3 generate.py --checkpoint_path checkpoints/stories15M/model.pth --prompt "Once upon a time" --device mps --compile
```
Before that it produced gibberish, now it works fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149437
Approved by: https://github.com/manuelcandales, https://github.com/dcci
2025-03-18 23:27:19 +00:00
ea02aac2ca [export] Update remove runtime asserts pass (#149198)
Test Plan: CI -- Removing asserts should be a noop

Differential Revision: D69566851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149198
Approved by: https://github.com/pianpwk
2025-03-18 23:07:25 +00:00
5db3a4ac88 [Build] Guard per-op headers in ACLUtils.cpp (#149417)
To fix internal build failures, where per-op headers are not generated.
We really should have lint for something like that.

Test Plan: CI

Reviewed By: izaitsevfb

Differential Revision: D71406882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149417
Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb
2025-03-18 22:56:29 +00:00
45fec7843d Fix local compilication and hipification (#149384)
Summary:
As title, we need to fix the issue introduced from
https://github.com/pytorch/pytorch/pull/148305

Test Plan: CI and e2e https://docs.google.com/document/d/1Bu-MxJCkN7WaRkKJLVBQvnSp8yV0v3Aeb3Y9R5sjeHw/edit?tab=t.0

Differential Revision: D71373001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149384
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/chenyang78
2025-03-18 22:56:02 +00:00
0d804dec0f [Profiler/Easy] Pass Overload Names To Kineto (#149333)
Summary: Right now we get Overload names and forward them to the Event List frontend for profiler but we do not forward anything to kineto. This diff checks if there is an overload name for each cpu op and appends it to the name if necessary

Test Plan: Added test in CI

Differential Revision: D71326670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149333
Approved by: https://github.com/aaronenyeshi
2025-03-18 22:15:51 +00:00
3b48c72141 [export] Minor refactor to trace.py (#149240)
Minor refactor to trace.py
* Removed `_strict_export_lower_to_aten_ir` in favor of just `_strict_export` and `_non_strict_export`
* Matched the APIs of `_strict_export` and `_non_strict_export`
    * Instead of a `lower_to_aten_callback` which is a callable, or `dispatch_tracing_mode`, both functions take in a `_to_aten_func` which can be either `_export_to_aten_ir_make_fx` or `_export_to_aten_ir`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149240
Approved by: https://github.com/pianpwk
2025-03-18 21:40:30 +00:00
010963032c [ONNX] Create onnx_symbolic (#148905)
In the old exporter we allow users to define a symbolic() method to bypass JIT tracing for a block of logic. We can allow users to do similar things by creating symbolic ops at export.

This PR implements `torch.onnx.ops.symbolic` and `torch.onnx.ops.symbolic_multi_out` to allow users to create onnx nodes symbolically with pt2 & fx. The custom pytorch ops were designed such that the attributes are encoded to be part of a valid fx op. Users provide shape and dtype for the meta function to produce the currect fake tensor during export.

An example is

![image](https://github.com/user-attachments/assets/c62f5f21-e038-456e-a71d-b9a5d0a7cd9d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148905
Approved by: https://github.com/titaiwangms
2025-03-18 21:32:06 +00:00
d80a70b58a Avoid unnecessary clone in torch.cuda.set_rng_state (#149283)
Clone has performance issue according to f49c3eb6e6/megatron/core/tensor_parallel/random.py (L77-L80)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149283
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-03-18 20:47:57 +00:00
cd5c13d8f0 [hop] Rework the check of Metadata in the functionalization key (#148789)
This PR is a more cosmetic rework of the metadata check performed by some HOPs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148789
Approved by: https://github.com/ydwu4
2025-03-18 20:30:59 +00:00
f06e366532 partitioner: treat inputs with static indices as free to save (#148922)
Fixes https://github.com/pytorch/pytorch/issues/141881

internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1538435030128036/?comment_id=1556782068293332

I tried to make a test case out of the code linked in that github issue. The setup + bad outcome today was as follows:

(1) you have a graph where one of its inputs is a model weight

(2) in the backward, you do some downstream compute on `weight`, `tmp = f(weight)`, where (a) `tmp` is of a smaller size than `weight`, and (b) the compute is trivially fusible into other kernels (so the partitioner thinks it is "free" to recompute

(3) since `sizeof(tmp) < sizeof(weight)` and the recompute is free, the partitioner decides that it would be strictly better to save `tmp` for backward instead of weight

(4) this is bad: `weight` is a static tensor that sits in GPU memory for the duration of your entire training loop, so saving it for backward has no negative impact on peak memory.  Since we're saving `tmp` instead, we end up unnecessarily increasing peak memory. In particular - the repro involves an autograd.Function in eager that saves the weight for bw, so we end up hitting higher peak memory in compile

The fix I'm trying out in this PR is to tell the partitioner that graph inputs that we know have static addresses (aka parameters) are "free" to save.

Below is the fw/bw graph before my change, where you can see that instead of `primals_2` being saved for backward, we save `t_8` (which involves some low precision downstream compute on `primals_2`, that is only needed in the backward.

```
 ===== Forward graph 0 =====
 /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1)
        view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]);  abs_1 = None
        amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]);  view = None
        abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2)
        view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]);  abs_2 = None
        amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]);  view_1 = None
        _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32);  amax = None
        clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12);  _to_copy = None
        div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0);  clamp = None
        reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div)
        view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64])
        view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]);  view_2 = None
        slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807);  reciprocal = None
        unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1);  slice_1 = None
        slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807);  unsqueeze = None
        unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3);  slice_2 = None
        mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1);  view_3 = unsqueeze_1 = None
        view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]);  mul = None
        view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]);  view_4 = None
        _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn);  view_5 = None
        _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32)
        clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12);  _to_copy_2 = None
        div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0);  clamp_1 = None
        reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1)
        view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64])
        view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]);  view_6 = None
        slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807);  reciprocal_1 = None
        unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1);  slice_3 = None
        slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807);  unsqueeze_2 = None
        unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3);  slice_4 = None
        mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3);  view_7 = unsqueeze_3 = None
        view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]);  mul_1 = None
        view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]);  view_8 = None
        _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn);  view_9 = None
        t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1);  div_1 = None
        new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False)
        new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False)
        t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3);  _to_copy_3 = None
        t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1);  new_ones_1 = None
        _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16);  _to_copy_1 = t_2 = new_ones = t_3 = None
        view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]);  _scaled_mm = None
        view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]);  view_10 = None
        slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807);  div = None
        unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1);  slice_5 = None
        slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807);  unsqueeze_4 = None
        unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3);  slice_6 = None
        mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5);  view_11 = unsqueeze_5 = None
        view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]);  mul_2 = None
        view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]);  view_12 = None
        view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]);  view_13 = None
        view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]);  view_14 = None
        slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807);  t = None
        unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1);  slice_7 = None
        slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807);  unsqueeze_6 = None
        unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3);  slice_8 = None
        mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7);  view_15 = unsqueeze_7 = None
        view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]);  mul_3 = None
        view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]);  view_16 = None
        _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16);  view_17 = None
        add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3);  _to_copy_4 = primals_3 = None
        t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2);  primals_2 = None
        clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format);  t_4 = None
        t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1);  amax_1 = None
        view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]);  t_5 = None
        amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]);  view_21 = None
        unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1);  amax_3 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1])
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3);  div_3 = None
        view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]);  clone = None
        view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]);  view_27 = None
        slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807);  reciprocal_3 = None
        unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1);  slice_11 = None
        slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807);  unsqueeze_11 = None
        unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3);  slice_12 = None
        mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12);  view_28 = unsqueeze_12 = None
        view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]);  mul_5 = None
        view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]);  view_29 = None
        _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn);  view_30 = None
        t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8);  _to_copy_8 = None

        # No stacktrace found for following nodes
        view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]);  add = None
        return (view_39, primals_1, unsqueeze_8, t_8)

INFO: TRACED GRAPH
 ===== Backward graph 0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", t_8: "f8e4m3fn[64, 64][1, 64]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1])
        view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]);  tangents_1 = None

         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19)
        view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]);  abs_3 = None
        amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]);  view_20 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]);  unsqueeze_8 = None
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32);  amax_2 = None
        clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12);  _to_copy_5 = None
        div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0);  clamp_2 = None
        reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2)
        view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64])
        view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]);  view_23 = None
        slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807);  reciprocal_2 = None
        unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1);  slice_9 = None
        slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807);  unsqueeze_9 = None
        unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3);  slice_10 = None
        mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10);  view_24 = unsqueeze_10 = None
        view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]);  mul_4 = None
        view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]);  view_25 = None
        _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn);  view_26 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3);  div_3 = None
        new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False)
        new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False)
        t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3);  new_ones_3 = None
        _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16);  _to_copy_6 = t_8 = new_ones_2 = t_9 = None
        view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]);  _scaled_mm_1 = None
        view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]);  view_31 = None
        slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807);  div_2 = None
        unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1);  slice_13 = None
        slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807);  unsqueeze_13 = None
        unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3);  slice_14 = None
        mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14);  view_32 = unsqueeze_14 = None
        view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]);  mul_6 = None
        view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]);  view_33 = None
        view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]);  view_34 = None
        view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]);  view_35 = None
        slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807);  t_6 = None
        unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1);  slice_15 = None
        slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807);  unsqueeze_15 = None
        unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3);  slice_16 = None
        mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16);  view_36 = unsqueeze_16 = None
        view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]);  mul_7 = None
        view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]);  view_37 = None
        _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16);  view_38 = None
        t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19)
        mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1);  t_10 = primals_1 = None
        sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]);  view_19 = None
        return (_to_copy_9, mm, sum_1)

```

With the change, we save primals_2 for backward instead

```
 ===== Forward graph 0 =====
 /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1)
        view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]);  abs_1 = None
        amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]);  view = None
        abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2)
        view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]);  abs_2 = None
        amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]);  view_1 = None
        _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32);  amax = None
        clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12);  _to_copy = None
        div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0);  clamp = None
        reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div)
        view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64])
        view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]);  view_2 = None
        slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807);  reciprocal = None
        unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1);  slice_1 = None
        slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807);  unsqueeze = None
        unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3);  slice_2 = None
        mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1);  view_3 = unsqueeze_1 = None
        view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]);  mul = None
        view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]);  view_4 = None
        _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn);  view_5 = None
        _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32)
        clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12);  _to_copy_2 = None
        div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0);  clamp_1 = None
        reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1)
        view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64])
        view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]);  view_6 = None
        slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807);  reciprocal_1 = None
        unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1);  slice_3 = None
        slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807);  unsqueeze_2 = None
        unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3);  slice_4 = None
        mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3);  view_7 = unsqueeze_3 = None
        view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]);  mul_1 = None
        view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]);  view_8 = None
        _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn);  view_9 = None
        t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1);  div_1 = None
        new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False)
        new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False)
        t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3);  _to_copy_3 = None
        t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1);  new_ones_1 = None
        _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16);  _to_copy_1 = t_2 = new_ones = t_3 = None
        view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]);  _scaled_mm = None
        view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]);  view_10 = None
        slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807);  div = None
        unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1);  slice_5 = None
        slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807);  unsqueeze_4 = None
        unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3);  slice_6 = None
        mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5);  view_11 = unsqueeze_5 = None
        view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]);  mul_2 = None
        view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]);  view_12 = None
        view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]);  view_13 = None
        view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]);  view_14 = None
        slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807);  t = None
        unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1);  slice_7 = None
        slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807);  unsqueeze_6 = None
        unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3);  slice_8 = None
        mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7);  view_15 = unsqueeze_7 = None
        view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]);  mul_3 = None
        view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]);  view_16 = None
        _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16);  view_17 = None
        add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3);  _to_copy_4 = primals_3 = None
        t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1);  amax_1 = None
        view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]);  t_5 = None
        amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]);  view_21 = None
        unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1);  amax_3 = None

        # No stacktrace found for following nodes
        view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]);  add = None
        return (view_39, primals_1, primals_2, unsqueeze_8)

INFO: TRACED GRAPH
 ===== Backward graph 0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1])
        view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]);  tangents_1 = None

         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2);  primals_2 = None
        clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format);  t_4 = None
        abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19)
        view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]);  abs_3 = None
        amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]);  view_20 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]);  unsqueeze_8 = None
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32);  amax_2 = None
        clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12);  _to_copy_5 = None
        div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0);  clamp_2 = None
        reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2)
        view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64])
        view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]);  view_23 = None
        slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807);  reciprocal_2 = None
        unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1);  slice_9 = None
        slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807);  unsqueeze_9 = None
        unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3);  slice_10 = None
        mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10);  view_24 = unsqueeze_10 = None
        view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]);  mul_4 = None
        view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]);  view_25 = None
        _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn);  view_26 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3)
        view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]);  clone = None
        view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]);  view_27 = None
        slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807);  reciprocal_3 = None
        unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1);  slice_11 = None
        slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807);  unsqueeze_11 = None
        unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3);  slice_12 = None
        mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12);  view_28 = unsqueeze_12 = None
        view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]);  mul_5 = None
        view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]);  view_29 = None
        _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn);  view_30 = None
        t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3);  div_3 = None
        new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False)
        new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False)
        t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8);  _to_copy_8 = None
        t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3);  new_ones_3 = None
        _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16);  _to_copy_6 = t_8 = new_ones_2 = t_9 = None
        view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]);  _scaled_mm_1 = None
        view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]);  view_31 = None
        slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807);  div_2 = None
        unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1);  slice_13 = None
        slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807);  unsqueeze_13 = None
        unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3);  slice_14 = None
        mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14);  view_32 = unsqueeze_14 = None
        view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]);  mul_6 = None
        view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]);  view_33 = None
        view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]);  view_34 = None
        view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]);  view_35 = None
        slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807);  t_6 = None
        unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1);  slice_15 = None
        slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807);  unsqueeze_15 = None
        unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3);  slice_16 = None
        mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16);  view_36 = unsqueeze_16 = None
        view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]);  mul_7 = None
        view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]);  view_37 = None
        _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16);  view_38 = None
        t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19)
        mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1);  t_10 = primals_1 = None
        sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]);  view_19 = None
        return (_to_copy_9, mm, sum_1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148922
Approved by: https://github.com/zou3519
2025-03-18 20:08:11 +00:00
b8c0c50bbe Release.md readability improvements (#149402)
Improves a bunch of readability/grammatical issues with release.md.

Note: This was a claude code experiment, with all changes automatically generated.  But turns out minor edits like this is _not_ a good use of claude code since it asked for approval on every single changed line.  Prob way more efficient to toss this entire thing into a simple LLM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149402
Approved by: https://github.com/atalman
2025-03-18 20:04:56 +00:00
dfdf58f8cb [ROCm] enable CK backend for bf16/fp16 on gfx11 (#143971)
this change enables enable CK backend for fp16 on Gfx11
@jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143971
Approved by: https://github.com/jeffdaily
2025-03-18 18:18:22 +00:00
e0e8639a10 [torchbench] fix dynamic_shapes spec for moco (#148772)
Fixes https://github.com/pytorch/pytorch/issues/148333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148772
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-03-18 18:16:54 +00:00
dbea13ed45 [ROCm][TunableOp] Minor fix to BLAS logging for ScaledGEMM with no bias vector. (#149357)
Omit the bias type argument for BLAS logging when there is a ScaledGEMM with no bias vector.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149357
Approved by: https://github.com/jeffdaily
2025-03-18 18:14:52 +00:00
c0566e0dbf [ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245)
Fixes https://github.com/ROCm/hip/issues/3764.

Fixes and improvements to CUDA->HIP flag conversion for CPP extensions

- Log flag conversion for debugging purposes.
- Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance.
- Fix case where nvcc key may not exist
- Fix case where hipify should ignore flag values and only touch the flag itself

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245
Approved by: https://github.com/jeffdaily

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
2025-03-18 18:01:07 +00:00
585fd972b8 Iterate over dense dim first in split reduction reindexing (#147229)
Fix for https://github.com/pytorch/pytorch/issues/144431.

Improves perf from 0.29963893827160504 -> 0.0396331632970453.

In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing.

This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229
Approved by: https://github.com/jansel
2025-03-18 17:35:21 +00:00
ee3a2c6ee2 [State_dict] Remove functools.cache and add unit test (#149354)
Fixes https://github.com/pytorch/pytorch/issues/149100

@functools.cache would keep 'self' alive, leading to unexpected memory performance. (e.g. in the issue linked, if the model is deleted, the model's memory is still occupied.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149354
Approved by: https://github.com/fegin
2025-03-18 17:30:41 +00:00
5b8cc4709a [FSDP2] Add set_reshard_after_forward (#149103)
Fixes https://github.com/pytorch/pytorch/issues/149029

Add `set_reshard_after_forward` to set `post_forward_mesh_info` so as to decide `_reshard_after_forward`

Add unit test similar to `test_fully_shard_communication_count`, the FSDPModule would perform as `._reshard_after_forward=True` after `.set_reshard_after_forward=True`, as well as setting to False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149103
Approved by: https://github.com/awgu
2025-03-18 17:21:54 +00:00
a8df5e5af9 [dynamo] Add mem leak test (#149358)
Test for https://github.com/pytorch/pytorch/pull/148480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149358
Approved by: https://github.com/malfet
2025-03-18 16:38:28 +00:00
d5b1d99f78 Enable more nightly tests on s390x (#148452)
Also enable some tests which probably were accidentally disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148452
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-18 16:09:39 +00:00
381d0cb239 [DCP] Avoid in-place update and deepcopy during dudpe (#149320)
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
2025-03-18 16:08:40 +00:00
c41196a4d0 [EZ][Docker] Remove install_db.sh (#149360)
Which is a vestige of caffe2 days and was no-op since https://github.com/pytorch/pytorch/pull/125092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149360
Approved by: https://github.com/atalman, https://github.com/cyyever, https://github.com/seemethere, https://github.com/Skylion007
2025-03-18 16:07:47 +00:00
fdacf3c920 [ONNX] Update types in VerificationInfo (#149377)
torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377
Approved by: https://github.com/titaiwangms
2025-03-18 15:37:39 +00:00
405025778d Revert "[AOTI] Update test runner to use the new APIs (#147105)"
This reverts commit 9a78513c3cb21a5f506135e2a56f967cf1fddc60.

Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413))
2025-03-18 15:25:40 +00:00
5ba437fb45 Revert "[AOTI] Forward fix unit test failures (#149401)"
This reverts commit ec9e11145e1a86300aae0fe09a1d8917d21deba1.

Reverted https://github.com/pytorch/pytorch/pull/149401 on behalf of https://github.com/desertfire due to reverting the original PR instead ([comment](https://github.com/pytorch/pytorch/pull/149401#issuecomment-2733633516))
2025-03-18 15:18:48 +00:00
213eea216a [MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340)
Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072063

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340
Approved by: https://github.com/chaos5958
2025-03-18 15:15:12 +00:00
ec9e11145e [AOTI] Forward fix unit test failures (#149401)
Summary: There is a land conflict between https://github.com/pytorch/pytorch/pull/149161 and https://github.com/pytorch/pytorch/pull/147105. We just need to update the APIs used in two new unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149401
Approved by: https://github.com/ZainRizvi
2025-03-18 15:02:01 +00:00
6e2b2660b9 Make numpy check optional (#149356)
We may want to skip numpy smoke tests. Hence making it optional

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149356
Approved by: https://github.com/ZainRizvi
2025-03-18 15:00:01 +00:00
bc88f6faa1 Use TorchVersion for triton version check (#149136)
Followup after https://github.com/pytorch/pytorch/pull/149092#issuecomment-2721990321
To use TorchVersion for triton version parsing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149136
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-18 13:48:46 +00:00
b06b5c3e27 [ROCm] Use alternate mirror for drm repo (#149380)
Fixes issue with building ROCm manywheel and libtorch images eg. https://github.com/pytorch/pytorch/actions/runs/13887711267/job/38854659005#step:4:8328

```
#53 2.832 Cloning into 'drm'...
#53 2.849 fatal: unable to access 'https://gitlab.freedesktop.org/mesa/drm.git/': The requested URL returned error: 503
#53 2.851 ./install_rocm_drm.sh: line 29: pushd: drm: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149380
Approved by: https://github.com/jeffdaily
2025-03-18 13:33:25 +00:00
6055a4f612 refresh benchmarks results. (#149347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149347
Approved by: https://github.com/jamesjwu
2025-03-18 08:53:49 +00:00
9b92828d4b Add batch dim sharding rule to sdpa (#149253)
This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253
Approved by: https://github.com/XilunWu
2025-03-18 07:54:02 +00:00
9cd52da45c [MPS/inductor] Add support for modified_bessel_i1. (#149379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149379
Approved by: https://github.com/malfet
2025-03-18 06:02:33 +00:00
6c2db8fab0 Enable qint8 and quint8 add for AArch64 using ACL directly (#148653)
This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly.
Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.

Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653
Approved by: https://github.com/malfet
ghstack dependencies: #148585
2025-03-18 05:38:39 +00:00
2e0c98ff05 [MPS] Add bicubic2d_aa (#149378)
Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287

Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from eec43cfbc0/src/libImaging/Resample.c as well as
bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)

Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa

At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378
Approved by: https://github.com/dcci
2025-03-18 05:35:41 +00:00
dea7157160 nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351)
Fixes #149153

Yaml generated from:

```
python .github/scripts/generate_ci_workflows.py
```

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

```
rm -rf third_party/nccl
python setup.py develop
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351
Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet
2025-03-18 05:23:18 +00:00
b8f91bcb14 [pt2_provenance_tracking] add support for cpp kernel (#149185)
Summary:
As title.

Add inductor cpp kernel to post grad graph node mapping
& UT.

Context:
Raised as a feature request for AOTI CPU case.

https://fb.workplace.com/groups/1028545332188949/permalink/1169020841474730/

Differential Revision: D71181284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149185
Approved by: https://github.com/jingsh
2025-03-18 04:43:07 +00:00
7869196482 Fix torchbind schema str generation (#149239)
Summary: Fix Torchbind HOP schema generation when there's no input

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
```

Differential Revision: D71231164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149239
Approved by: https://github.com/zou3519
2025-03-18 04:29:56 +00:00
bca75fe97a [MAIA] [Autocast] Enable autocast on MAIA device (#148511)
Fixes #148510.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148511
Approved by: https://github.com/albanD
2025-03-18 03:46:22 +00:00
c43e35d6f7 [MPS] Implement support for modified_bessel_i1 in eager. (#149368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-18 03:29:10 +00:00
bb42e4d137 [AOTInductor] Add function to free buffer (#149161)
Summary:
We add a function that allows users to free the unused buffer.

Test Plan:
Testing correctness:
    python test/inductor/test_aot_inductor.py -k free_inactive

    Testing memory consumption:
    LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
    /home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #149249
2025-03-18 02:43:14 +00:00
cccdf860e2 [BE] Add STABLE_LIBRARY test for multiple returns (#149230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149230
Approved by: https://github.com/albanD, https://github.com/zou3519
ghstack dependencies: #149052
2025-03-18 02:40:54 +00:00
988827cdfb Use schema as source of truth + support ones_like/empty_like (#149052)
This change does 2 important things:
(a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat!

(b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052
Approved by: https://github.com/albanD
2025-03-18 02:40:54 +00:00
ebabd0efdd [ONNX] Expose verification utilities (#148603)
Expose verification utilities to public documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603
Approved by: https://github.com/titaiwangms
2025-03-18 02:10:34 +00:00
c36ac16da1 [Inductor] optimize welford reduction (#145061)
Fix https://github.com/pytorch/pytorch/issues/141541.
Fix https://github.com/pytorch/pytorch/issues/142839.
Fix https://github.com/pytorch/pytorch/issues/143182.

**Summary:**
In order to fix the issue that the accuracy of welford reduction is not good enough, we refer to the eager implementation, combine Welford algorithm with cascade sum to improve numerical stability. Specifically:
1. Use Welford algorithm to compute mean and variance.
2. Use cascade summation when computing sum over input for both mean and variance.

I tested Inductor benchmark with this PR on CPU, no performance gains or regressions were seen.

**Example:**
Take https://github.com/pytorch/pytorch/issues/141541 as an example:
```
import torch
import torch.nn as nn
torch.manual_seed(0)

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)

    def forward(self, x):
        return self.gn(x)

model = Model().eval()
c_model = torch.compile(model)
x = torch.randn(1, 32, 128, 128, 128)

with torch.no_grad():
    output = model(x)
    c_output = c_model(x)

print(torch.max(torch.abs(output - c_output)))
print(torch.allclose(output, c_output, 1.3e-6, 1e-5))
```
**logs**

- before
```
tensor(7.0095e-05)
False
```
- After
```
tensor(9.5367e-07)
True
```

- on CUDA
```
tensor(1.4305e-06, device='cuda:0', grad_fn=<MaxBackward1>)
True
```

**Generated code:**
- before
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(131072L));
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                        }
                    }
                }
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean);
                out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2);
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                        auto tmp1 = out_ptr0[static_cast<int64_t>(x0)];
                        auto tmp4 = out_ptr1[static_cast<int64_t>(x0)];
                        auto tmp12 = in_ptr1[static_cast<int64_t>(x0)];
                        auto tmp15 = in_ptr2[static_cast<int64_t>(x0)];
                        auto tmp2 = at::vec::Vectorized<float>(tmp1);
                        auto tmp3 = tmp0 - tmp2;
                        auto tmp5 = static_cast<float>(2097152.0);
                        auto tmp6 = tmp4 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                        auto tmp9 = 1 / std::sqrt(tmp8);
                        auto tmp10 = at::vec::Vectorized<float>(tmp9);
                        auto tmp11 = tmp3 * tmp10;
                        auto tmp13 = at::vec::Vectorized<float>(tmp12);
                        auto tmp14 = tmp11 * tmp13;
                        auto tmp16 = at::vec::Vectorized<float>(tmp15);
                        auto tmp17 = tmp14 + tmp16;
                        tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0));
                    }
                }
            }
        }
    }
}
''')
```
- After
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/ln/clnlak27xpvmq3klpqyj6xzyq2thf4ecrezve5ddy4f4xaz4sb7w.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                WelfordHelper<at::vec::Vectorized<float>> welford_helper0(static_cast<int64_t>(131072L));
                static WelfordHelper<at::vec::Vectorized<float>> masked_welford_helper0(static_cast<int64_t>(0L));
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &welford_helper0);
                        }
                    }
                }
                tmp_acc0_vec = welford_combine(tmp_acc0_vec, &welford_helper0);
                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, &masked_welford_helper0);
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean);
                out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2);
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                        auto tmp1 = out_ptr0[static_cast<int64_t>(x0)];
                        auto tmp4 = out_ptr1[static_cast<int64_t>(x0)];
                        auto tmp12 = in_ptr1[static_cast<int64_t>(x0)];
                        auto tmp15 = in_ptr2[static_cast<int64_t>(x0)];
                        auto tmp2 = at::vec::Vectorized<float>(tmp1);
                        auto tmp3 = tmp0 - tmp2;
                        auto tmp5 = static_cast<float>(2097152.0);
                        auto tmp6 = tmp4 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                        auto tmp9 = 1 / std::sqrt(tmp8);
                        auto tmp10 = at::vec::Vectorized<float>(tmp9);
                        auto tmp11 = tmp3 * tmp10;
                        auto tmp13 = at::vec::Vectorized<float>(tmp12);
                        auto tmp14 = tmp11 * tmp13;
                        auto tmp16 = at::vec::Vectorized<float>(tmp15);
                        auto tmp17 = tmp14 + tmp16;
                        tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0));
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145061
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2025-03-18 02:05:35 +00:00
cyy
1096443467 Use torch_compile_options for c10 libraries (#147821)
c10, c10_cuda, c10_hip and c10_xpu are given additional compile options by torch_compile_options, which are more restrictive and can help reveal potential bugs inside the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147821
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-03-18 01:54:23 +00:00
60523540f1 Force build to conform C++ standard on windows by adding /permissive- flag (#149035)
Fixes #147366

1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard.
2. Fix the error when trying to assign a string literal to a non-const ptr.

The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks),
>  By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions.
> The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option.

Thus, it is reasonable to add this flag to the existing project.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-03-18 01:51:46 +00:00
c1dd75e4dc Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031)
**Summary**
Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907
This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT.

**Test plan**
```
pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031
Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire
2025-03-18 01:33:13 +00:00
cyy
425c6d8eba Replace c10::is_pod with std::is_trivial (#149286)
These remaining c10::is_pod calls can be replaced without compromising the semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149286
Approved by: https://github.com/zou3519
2025-03-18 01:33:01 +00:00
f9a787224c [dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
2025-03-18 01:25:37 +00:00
186cc7327c [MPS/BE] Remove decorator that skipped test on macOS 12. (#149365)
macOS 12 is not really supported anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365
Approved by: https://github.com/malfet
2025-03-18 00:58:08 +00:00
a0ac63cbd9 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-18 00:46:07 +00:00
811f587d86 [MPS/BE] @parametrize generation of pointwise_ops. (#149363)
Make this less error prone/reduces duplication.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149363
Approved by: https://github.com/malfet
2025-03-18 00:37:43 +00:00
9a78513c3c [AOTI] Update test runner to use the new APIs (#147105)
Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring.

Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105
Approved by: https://github.com/jingsh
2025-03-18 00:27:09 +00:00
b52a8bef01 Revert "[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)"
This reverts commit 5905bbe745b0acb4909243c93014c0e6f3512c2d.

Reverted https://github.com/pytorch/pytorch/pull/149228 on behalf of https://github.com/malfet due to I wonder if this will fix the pr-time-benchmark regressions ([comment](https://github.com/pytorch/pytorch/pull/149228#issuecomment-2731237949))
2025-03-18 00:10:50 +00:00
46226a90c8 [EZ][BE] Remove cross-compilation options from mac-build.yml (#149237)
It has long been gone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149237
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-03-17 23:50:31 +00:00
523bffd388 cd: Add no-cache for test binaries (#149218)
This is to make it so that we don't experience issues like https://github.com/pytorch/vision/actions/runs/13861462856/job/38795684317#step:13:212

```
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    unknown package:
        Expected sha256 8e34a6f02ac5a63763251953063a19ba9df855ac2c8a13ef409dfef708e2ba26
             Got        341156cc5067488565c1e103be6e95105b0fc0d87d8ac24ff8891f63fd33216f
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149218
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet
2025-03-17 23:26:20 +00:00
37c914ca0c fix simple-spec crash (#147723)
found an issue while running `python torchgen/fuse/gen_patterns.py`

exact error:
```shell
Traceback (most recent call last):
  File "/Users/mayankmishra/Desktop/non-IBM/pytorch/torchgen/fuse/gen_patterns.py", line 19, in <module>
    joint_graph.lazy_init()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 2096, in lazy_init
    result = fn()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 53, in lazy_init
    _pad_mm_init()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/pad_mm.py", line 905, in _pad_mm_init
    gen_register_replacement(
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1584, in gen_register_replacement
    pat = _serialize_pattern(
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1539, in _serialize_pattern
    file_template = get_file_template()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1513, in get_file_template
    if isinstance(attr, type) and issubclass(attr, (PatternExpr, _TargetExpr)):
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
TypeError: issubclass() arg 1 must be a class
```

This PR fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147723
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@meta.com>
2025-03-17 23:25:48 +00:00
78715a181f Convert Tensor lr to 0-dim as needed for the optimizer to normally work (#145674)
Fixes #145461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145674
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-17 23:07:05 +00:00
1157367c78 [AOTInductor] [BE] Add macro for loading symbols in aoti runner (#149249)
Summary:
Add macro for loading symbols in aoti runner

Test Plan:
Existing tests

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149249
Approved by: https://github.com/chenyang78
2025-03-17 23:02:01 +00:00
24cfeec2c7 Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)"
This reverts commit bfee141666319c80b6c5284394905beef8682515.

Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see 8bc7bd94a5/1 ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))
2025-03-17 22:57:00 +00:00
afa1eda901 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))
2025-03-17 22:43:15 +00:00
a16ada41b9 Fix outdated docstring of torch.export.export regarding strict flag (#149077)
Summary: Fix outdated docstring of torch.export.export regarding strict flag

Test Plan: None, doc only change

Differential Revision: D71068215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149077
Approved by: https://github.com/zhxchen17
2025-03-17 22:29:20 +00:00
d25617255c Fix AOTI update_constant_buffer issue. (#149243)
Summary:
In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like
```
I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights
*** Aborted at 1741652964 (Unix time, try 'date -d 1741652964') ***
*** Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: ***
    @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453
    @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/fibers/GuardPageAllocator.cpp:237
    @ 000000000004455f (unknown)
                       /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8
                       -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
    @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque*> > > const&, bool, bool)
```

Test Plan:
1) Generate lowered merge net
```
CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par  --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor  --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false
```

2) Load net predictor
```
CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true
```

Reviewed By: hl475

Differential Revision: D71236710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243
Approved by: https://github.com/hl475, https://github.com/jingsh
2025-03-17 22:10:57 +00:00
a3c6e3139a allow extra args for parameterization of tests in inductor (#149154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149154
Approved by: https://github.com/amjames, https://github.com/eellison
2025-03-17 22:05:06 +00:00
e4f6e4ac84 [MPS] Add inductor support for modified_bessel_i0. (#149342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149342
Approved by: https://github.com/malfet
2025-03-17 21:45:51 +00:00
8bc7bd94a5 [ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527)
This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527
Approved by: https://github.com/jeffdaily

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-03-17 20:51:36 +00:00
e8dd58b8cf cpp_wrapper: Precompile device-specific header files (#146928)
This saves us about a second per compilation, which is _massive_ for the OpInfo tests.  Total OpInfo test runtime is down about 2x from this change alone.

Relands #144002, with changes needed by fbcode internals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928
Approved by: https://github.com/desertfire
2025-03-17 20:40:15 +00:00
5e9f792479 [ROCm] Unskip flex attention UTs after triton 3.3 bump (#148327)
Enable `test_flex_attention.py::TestLearnableBiases` unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148327
Approved by: https://github.com/jeffdaily
2025-03-17 20:15:14 +00:00
6c7d8419e3 fix two accuracy regression (#149172)
There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check.

- error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316
- error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172
Approved by: https://github.com/jansel, https://github.com/eellison
2025-03-17 19:34:00 +00:00
769f19bf95 [MTIA] Add _mtia_exchangeDevice to MTIA module (#149322)
Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322
Approved by: https://github.com/chaos5958
2025-03-17 19:31:10 +00:00
8d7c430e84 Symintify transpose_ (#149057)
Fixes https://github.com/pytorch/pytorch/issues/148702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057
Approved by: https://github.com/yushangdi
2025-03-17 19:11:54 +00:00
08a644a4c4 Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585)
This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation.
However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`.

- **For `qlinear_dynamic` (dynamically quantized matmuls):**

This PR yields an ****average speedup** (averaged over context_lengths of 2^3 up to 2^9) of ~ **50%** for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased`** with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below:
```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
from transformers import AutoModel, AutoConfig
import time
import numpy as np
from argparse import ArgumentParser

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__(description="huggingface model")
        self.add_argument("--context_length",
                            help="context length - number of input tokens",
                            type=int,
                            default=64
        )
        self.add_argument("--model",
                            help="model checkpoint - i.e. 'bert-base-uncased'",
                            type=str,
                            default=None)
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()
    model_name = args.model
    config = AutoConfig.from_pretrained(model_name)
    batch_size = 1
    model = AutoModel.from_pretrained(model_name)
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    model.eval()
    inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu")
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("Model = ", model_name)
    print("Context Length = ", args.context_length)
    print("Min (ms) = ", min(times))
    print("Mean (ms) = ", np.mean(times))
```

- **For `qlinear` (statically quantized matmuls):**

This PR yields an **average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below.
The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`.
The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64.

```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
from torch.quantization import QConfig
from torch.ao.quantization.observer import HistogramObserver, default_weight_observer
import torch
import torch.nn as nn
import numpy as np
import random
from argparse import ArgumentParser
import time

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__()
        self.add_argument("--M",
                            help="M dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--K",
                            help="K dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--N",
                            help="N dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--signed_input",
                            help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8",
                            action="store_true"
        )
        self.add_argument("--seed",
                          help="Random seed",
                          type=int,
                          default=42
        )
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

class LinearModel(nn.Module):
    def __init__(self, K, N):
        super(LinearModel, self).__init__()
        self.quant = torch.quantization.QuantStub()
        self.fc = nn.Linear(K, N)
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

def quantize_model(model, args):
    qconfig = QConfig(
            activation=HistogramObserver.with_args(reduce_range=False,
            dtype=torch.qint8 if args.signed_input else torch.quint8),
            weight=default_weight_observer,
    )
    # Prepare the model for static quantization
    # Specify quantization configurations
    model.qconfig = qconfig
    model_prepared = torch.quantization.prepare(model_fp32)

    # Calibrate the model with sample inputs
    # Example input data for calibration
    with torch.no_grad():
        sample_data = torch.randn(args.M, args.K)
        model_prepared(sample_data)
    # Convert the prepared model to a quantized model
    model_quantized = torch.quantization.convert(model_prepared)
    return model_quantized

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()

    set_seed(args.seed)
    model_fp32 = LinearModel(args.K, args.N)
    model_quantized = quantize_model(model_fp32, args)

    inputs = torch.randn(args.M, args.K)
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model_quantized(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model_quantized(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input)
    print("Min Times (ms) = ", min(times))
    print("Mean Times (ms) = ", np.mean(times))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-17 18:21:10 +00:00
c41c2130be Fix printing INT64_MIN (#149148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149148
Approved by: https://github.com/anijain2305
2025-03-17 17:57:18 +00:00
8cdb9adc05 do not run test_ck_blas_library on cpu (#148316)
Fix on non-rocm:

```
root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
E
======================================================================
ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test
    raise rte
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test
    result = test(self, **param_kwargs)
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn
    fn(*args, **kwargs)
  File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library
    torch.backends.cuda.preferred_blas_library('ck')
  File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library
    torch._C._set_blas_preferred_backend(_BlasBackends[backend])
RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm.

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.346s

FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316
Approved by: https://github.com/jeffdaily
2025-03-17 17:45:45 +00:00
224cd9f055 [ez] Flush trymerge print statements (#149012)
Logs of trymerge don't match up with timestamps, ex
https://github.com/pytorch/pytorch/actions/runs/13766246347/job/38493307591
Ex:
```
2025-03-10T14:20:41.4899509Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (0.003460856278737386 minutes elapsed)
...
2025-03-10T14:20:41.4907867Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 16 jobs to finish, first few of them are: Check Labels / Check labels, trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build. Retrying in 5 min
2025-03-10T14:20:41.4909772Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (5.280085611343384 minutes elapsed)
...
2025-03-10T14:20:41.4916812Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 15 jobs to finish, first few of them are: trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build, trunk / linux-focal-cuda12.6-py3.10-gcc11-no-ops / build. Retrying in 5 min
2025-03-10T14:20:41.4918183Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (10.590279157956441 minutes elapsed)
```

Either buffering prints or github actions logs are being weird?

Print with flush to see if it helps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149012
Approved by: https://github.com/malfet
2025-03-17 17:04:48 +00:00
aaa4c3d60b [mm_logs] make aten mm info readable (#148800)
Summary:
as title. make it into a table like

e.g. also see pic in test plan

| Name     | M   | N   | K   | Count |
| aten.mm | 16  | 6   |  16 |     1     |
...

Test Plan: {F1975907876}
<img width="1090" alt="Screenshot 2025-03-11 at 3 13 00 PM" src="https://github.com/user-attachments/assets/ffae8c56-e32c-49cc-bbfb-5b8d216b8657" />

Differential Revision: D70825664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148800
Approved by: https://github.com/henrylhtsang
2025-03-17 17:00:58 +00:00
2a011ca904 [ROCm] testing: enable MEFF/FA unittests for gfx1100 (#148911)
Include gfx1100, and optionally enable gfx1201/gfx950 according to env var TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148911
Approved by: https://github.com/jeffdaily
2025-03-17 16:41:15 +00:00
9d37b501db Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit 2e02c07a5d1c432547542f90de2885be9ffd13cf.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally.  @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))
2025-03-17 16:17:02 +00:00
c7c3e77324 Refine XPU oneDNN context manager API (#147349)
# Motivation
This PR introduces improvements to the XPU oneDNN context manager API:

- `GpuEngineManager::get_engine`: Added a new API that accepts a `DeviceIndex` to simplify code and improve usability - by default, using the current device index.
- `GpuStreamManager::get_stream`: Now explicitly requires a `DeviceIndex` as input to ensure correctness and consistency - by default, using the current device index.

Additionally, it enhances integration with `c10::DeviceGuard`, ensuring correct device management.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147349
Approved by: https://github.com/EikanWang
2025-03-17 14:45:56 +00:00
790f93db3a Update slow tests (#149300)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149300
Approved by: https://github.com/pytorchbot
2025-03-17 11:39:29 +00:00
b2862f1435 optimize the decomposition of aten.native_group_norm (#144733)
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.

The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],

The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],

I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-17 09:27:01 +00:00
1cc5f6b623 Optimize MaxPool1d param ceil_mode description (#148869)
Fixes #148123

Add output shape formula based on `ceil_mode` value, according to

00199acdb8/aten/src/ATen/native/Pool.h (L61-L75)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/0a175178-a104-4348-a14b-516e866d533a)

### After

![image](https://github.com/user-attachments/assets/ce621d4b-1986-41fb-bd71-2b03c0aa996e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148869
Approved by: https://github.com/mikaylagawarecki
2025-03-17 08:50:40 +00:00
916e8979d3 Skip some tests not using gradcheck on slowgradcheck (#149220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149220
Approved by: https://github.com/seemethere
2025-03-17 00:34:52 +00:00
eqy
6048d88afe [ARM64][CUDA] skip string pattern matching in test_workspace_allocation_error (#149236)
`unwind()` on ARM64 seems to elide the strings of interest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149236
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/BoyuanFeng
2025-03-17 00:30:43 +00:00
bfee141666 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-16 23:52:58 +00:00
6b1b95ad2a Support subclass constructor capturing in export (#147014)
Notable TODOs:
1. Need to implement AutogradHOP to get rid of subclasses before serializing
2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs

Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014
Approved by: https://github.com/bdhirsh
2025-03-16 18:19:19 +00:00
5905bbe745 [dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
2025-03-16 15:56:17 +00:00
9f33c6f0a0 [MPS] Add support for modified_bessel_i0 in eager. (#149264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-16 04:45:49 +00:00
f80bee4934 [MPS][BE] Move common binary ops macros to indexing.h (#149263)
And binary op invocation logic to OperationUtils.mm

This is a no-op change, additional sanity checks/logic improvements will be added as followups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149263
Approved by: https://github.com/dcci
ghstack dependencies: #149262
2025-03-16 02:06:40 +00:00
21c2edfec8 [MPS/metal] Add missing inline to function definitions. (#149265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149265
Approved by: https://github.com/malfet
2025-03-16 00:33:27 +00:00
3e2c4086ad [EZ][BE] Reuse result_of from c10/metal/utils.h (#149262)
No need for one more implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149262
Approved by: https://github.com/dcci
2025-03-16 00:21:28 +00:00
acf42b0048 Fix memory leak in subproc_pool future (#149259)
Summary: The future holds a reference to the callback, and the callback captures the outer future. Seems to create a cycle that the garbage collector doesn't clean up. Verified by compiling 15k synthetic Triton kernels and observing that subprocess memory overhead improves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149259
Approved by: https://github.com/Skylion007
2025-03-15 20:26:30 +00:00
a9c55277d7 [Reland] First version of statically compiled launcher for triton compiled CUDA kernels (#149238)
This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure

Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.

This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.

Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66

Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.

The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.

This diff does not add the launcher to torch, but introduces a basic test suite.

A list of TODOs that are not yet complete:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Probably lots of features of the triton C++ generated code that I haven't handled yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238
Approved by: https://github.com/oulgen
2025-03-15 15:06:46 +00:00
c83c711da8 Remove some memory overhead in parallel compile workers (#149168)
Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead.

Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu:
* After importing torch in a subproc: 371M
* Without this PR, after compiling 15k kernels: 825M
* With this PR, after compiling 15k kernels: 531M

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168
Approved by: https://github.com/jansel
2025-03-15 14:20:40 +00:00
e7e477c1f9 Not generate custom obj json when it's empty (#149246)
Summary: as title.

See internal Diff summary for more context.

Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated

Differential Revision: D71241676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246
Approved by: https://github.com/houseroad

Co-authored-by: Huamin Li <huaminli@meta.com>
2025-03-15 13:00:48 +00:00
4482a65fef Add side_effect to avoid dce custom op in CA graph (#149181)
We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181
Approved by: https://github.com/xmfan
2025-03-15 04:15:49 +00:00
115fc98cc0 Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106)
Summary:
Use Sharding Strategy for aten.split.Tensor instead of sharding rule

Test Plan:
pytest test/distributed/tensor/test_dtensor_ops.py -s -k split

Reviewers:
xilunwu

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
2025-03-15 04:03:40 +00:00
740ce0fa5f op should NOT be static in aoti_torch_call_dispatcher (#149208)
aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet
2025-03-15 01:47:11 +00:00
578160c875 [ca] don't inline accumulate grad op (#149014)
we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014
Approved by: https://github.com/jansel
ghstack dependencies: #149064
2025-03-15 01:10:54 +00:00
f4368d8872 [ca] clean up aot node deduping (#149064)
rename the AOT nodes as we copy paste them into the CA graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149064
Approved by: https://github.com/jansel
2025-03-15 01:10:54 +00:00
96795e9533 [BE] Parametrize TestMPS.test_binops_dtype_precedence (#149234)
No op change, just splits a longer tests into a series of a smaller ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234
Approved by: https://github.com/atalman, https://github.com/dcci
ghstack dependencies: #149216, #149233
2025-03-15 00:37:11 +00:00
1c7196f04b Add new GHA workflow to cache ROCm CI docker images on MI300 CI runners periodically (#148394)
Refiling https://github.com/pytorch/pytorch/pull/148387 from pytorch repo branch to get AWS login via OIDC working

Successful docker caching run: https://github.com/pytorch/pytorch/actions/runs/13843689908/job/38737095535
Run without cached docker image: https://github.com/pytorch/pytorch/actions/runs/13843692637/job/38746033460
![image](https://github.com/user-attachments/assets/c410ff35-a150-4885-b904-3a5e1888c032)
Run with cached docker image:
![image](https://github.com/user-attachments/assets/41e417b5-a795-4ed2-a9cd-00151db8f813)
~6 min vs 3 s :)

Thanks @saienduri for the help on the MI300 infra side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148394
Approved by: https://github.com/jeffdaily
2025-03-15 00:34:04 +00:00
9ad6265d04 [AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175)
The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175
Approved by: https://github.com/jansel
2025-03-15 00:30:04 +00:00
7537b19c73 [FSDP2] Update ignored_params docstring and add unit test (#149074)
Fixes https://github.com/pytorch/pytorch/issues/148242

ignored_params won't be moved to devices in full_shard(), update docstring.
Add unit test `test_move_states_to_device_ignored_param_device` to show that ignored_params won't be moved during full_shard(), but would be after `model.cuda()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149074
Approved by: https://github.com/awgu
2025-03-15 00:23:09 +00:00
09f7f62cfe Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070)
**Issue:**
* The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards.
* Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve
* This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated.

**Fix:**
* Updated the build flags to explicitly use **-march=armv8-a+sve**, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before.
* This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware.

Test plan:
 - Allocate `a1.4xlarge` on AWS
 - Run following script using wheel produced by this PR
 ```python
import torch
def f(x):
    return x.sin() + x.cos()

print(torch.__version__)
f_c = torch.jit.script(f)
```
- Observe no crash
```
$ python3 foo.py
2.7.0.dev20250313+cpu
```
- Observe crash with 2.6.0
```
$ python3 foo.py
2.6.0+cpu
Illegal instruction (core dumped)
```

Fixes #146792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070
Approved by: https://github.com/malfet
2025-03-15 00:02:38 +00:00
08af311fc2 [MPS] Fix type promotion for torch.floor_divide (#149233)
And delete some duplicating glue code by relying on the stub
After this change `torch.arange(10, device = 'mps') // torch.arange(10., device='mps')` will return tensor of floats, which is a common dtype for float + integral operation, rather than tensor of ints
Checked by `test_div2` inductor testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149233
Approved by: https://github.com/atalman
ghstack dependencies: #149216
2025-03-15 00:00:42 +00:00
eb7bf4202d Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-14 23:38:19 +00:00
ff58ccec6c [ATen-CPU] Add math.h for Gelu (#149164)
Summary:
## Context

This PR is mostly to enable ExecuTorch build for Windows: https://github.com/pytorch/executorch/pull/9198

In ExecuTorch, the optimized GeLU kernel calls the ATen implementation. However, on Windows `math.h` needs to be included with `#define _USE_MATH_DEFINES` in order for math constants to be defined.

Test Plan:
Rely on CI to make sure existing tests do not break. Tested separately with ExecuTorch to make sure Windows build is successful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149164
Approved by: https://github.com/swolchok
2025-03-14 23:37:25 +00:00
f9b4856989 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit c95a6b416b4d1b830535f82e2719c055d077cbad.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))
2025-03-14 23:13:34 +00:00
643aaea133 Revert "[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561)"
This reverts commit 5a843f8973d7fc6a601f089fc969d2a5ac7e5338.

Reverted https://github.com/pytorch/pytorch/pull/148561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148561#issuecomment-2725969268))
2025-03-14 23:01:26 +00:00
05f2cbfe19 Add meta function for out variants of ones,zeros,empty (#149098)
Open another PR to fix merge conflicts. Fixes https://github.com/pytorch/pytorch/issues/135832

For aten.ones, aten.zeros, followed this [link](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.64r4npvq0w0) to register meta functions.

For aten.empty.out, followed this [part](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.iy9lxhxhtl5v) to register a decomp for empty that handles the FakeTensor input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149098
Approved by: https://github.com/williamwen42
2025-03-14 22:17:30 +00:00
d7d9a71e19 [MPSInductor] Add support for atan2 (#149216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149216
Approved by: https://github.com/dcci
2025-03-14 21:53:03 +00:00
dd6e9df3d0 [MPS] fix attention enable_gqa crash on mps (#149147)
Fixes #149132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147
Approved by: https://github.com/malfet
2025-03-14 21:25:54 +00:00
0bd863a62f [MPS] Add inductor support for i1e. (#149221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149221
Approved by: https://github.com/malfet
2025-03-14 21:18:38 +00:00
a0893475ba Enable oneDNN dispatch for gemm bf16bf16->bf16 (#148197)
Currently, `linear` layers using BF16 are dispatched to OpenBLAS, provided that sbgemm_ is available.
However, profiling on AArch64 shows that dispatching to oneDNN results in a significant speedup. This PR updates the dispatch logic to leverage oneDNN for improved performance.

Attaching some benchmark results. Instance: NeoverseV1., on 16 threads.

<img width="482" alt="Screenshot 2025-02-28 at 17 18 38" src="https://github.com/user-attachments/assets/b84e7455-af6e-417f-920d-bdd2bec2e8f9" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148197
Approved by: https://github.com/malfet
2025-03-14 20:58:24 +00:00
1bdbf12672 Update as strided doc (#149146)
Make it clearer why it is not recommended to use it and when the resulting Tensor will have undefined behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149146
Approved by: https://github.com/gchanan, https://github.com/jbschlosser
2025-03-14 19:49:57 +00:00
69aeb87eca update error message in get_backend() more detail_ (#141796)
Fixes #ISSUE_NUMBER
When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message.
```
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │
                             │ ed_c10d.py:1215 in get_backend                                                                                            │
                             │                                                                                                                           │
                             │   1212 │   if _rank_not_in_group(pg):                                                                                     │
                             │   1213 │   │   raise ValueError("Invalid process group specified")                                                        │
                             │   1214 │   pg_store = _world.pg_map[pg] if pg in _world.pg_map else None                                                  │
                             │ ❱ 1215 │   return Backend(not_none(pg_store)[0])                                                                          │
                             │   1216                                                                                                                    │
                             │   1217                                                                                                                    │
                             │   1218 def _get_process_group_uid(pg: ProcessGroup) -> int:                                                               │
                             │                                                                                                                           │
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │
                             │ y:13 in not_none                                                                                                          │
                             │                                                                                                                           │
                             │   10                                                                                                                      │
                             │   11 def not_none(obj: Optional[T]) -> T:                                                                                 │
                             │   12 │   if obj is None:                                                                                                  │
                             │ ❱ 13 │   │   raise TypeError("Invariant encountered: value was None when it should not be")                               │
                             │   14 │   return obj                                                                                                       │
                             │   15                                                                                                                      │
                             ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                             TypeError: Invariant encountered: value was None when it should not be
Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0>
```
Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796
Approved by: https://github.com/kwen2501
2025-03-14 19:42:42 +00:00
5e79b61e8a add PrivateUse1 backend in fsdp collecitves (#147260)
add PrivateUse1 backend in fsdp collecitves

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147260
Approved by: https://github.com/weifengpy
2025-03-14 19:41:41 +00:00
fe01af2242 [AOTI][debug logger] small fix for intermediate value debugger for jit when arg is not tensor (#149007)
repro:
```
import torch
import torch._inductor.config as config

config.aot_inductor.debug_intermediate_value_printer = "2"
config.aot_inductor.filtered_kernel_names = "triton_poi_fused__to_copy_add_0"

class Model(torch.nn.Module):
    def forward(self, x):
        x = x.to(torch.float)
        return x + 1

model = Model().cuda()
x = torch.randn(10).cuda().to(torch.float8_e4m3fn)
_ = torch.compile(model, fullgraph=True)(x)

print("done")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149007
Approved by: https://github.com/jingsh
2025-03-14 19:40:41 +00:00
c96ed7e6f5 [BE]: No include left behind - recursive glob setuptools support (#148258)
Fixes #148256
TestPlan check the printout from the setup.py build and verify the files are still included.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148258
Approved by: https://github.com/malfet, https://github.com/benjaminglass1
2025-03-14 19:39:21 +00:00
9d7945e382 [EZ] Fix typo in UnaryOps.mm (#149217)
s/imput/input/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149217
Approved by: https://github.com/ZainRizvi, https://github.com/dcci
2025-03-14 19:31:20 +00:00
a7f8de2198 Add nn.Bilinear param validation (#149018)
Fixes #103425

## Changes

- Add doc description size value `must be > 0`
- Add validation for `in1_features` param

Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking.

```python
import torch
from torch import nn

class lenet(nn.Module):
    def __init__(self):
        super(lenet, self).__init__()
        self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1)

        # Error, `in1_features=1, in2_features=0, out_features=0` no error
        self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0)

    def forward(self, x):
        # 1st block
        x = self.conv(x)
        x = self.linear(x)

        return x

if __name__ == '__main__':
    net = lenet()

```

## Test Result

```bash
pytest test/test_nn.py -k test_bilinear -vv
```

![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43)

![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018
Approved by: https://github.com/mikaylagawarecki
2025-03-14 19:26:12 +00:00
5a843f8973 [RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561)
Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.

This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.

Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66

Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.

The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.

This diff does not add the launcher to torch, but introduces a basic test suite.

A list of TODOs that are not yet complete, will do in separate diff:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z. With https://github.com/pytorch/pytorch/pull/147583, we should be able to handle all of the grid logic directly in _StaticCudaLauncher.launch_kernel, and get rid of the python evaluation.
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Hooking it up with a config to inductor
- Testing harness to test against torch generated triton kernels

Differential Revision: [D69926783](https://our.internmc.facebook.com/intern/diff/D69926783/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148561
Approved by: https://github.com/aorenste, https://github.com/syed-ahmed
2025-03-14 19:12:13 +00:00
97272e4b49 Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-14 18:53:10 +00:00
2e02c07a5d [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily
2025-03-14 18:21:27 +00:00
f2221b2fce [MPS] Add support for i1e (#149203)
Followup after https://github.com/pytorch/pytorch/pull/149174
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149203
Approved by: https://github.com/dcci
2025-03-14 17:33:52 +00:00
f067eafabb [MPS] Modify a test to test the correct function. (#149204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149204
Approved by: https://github.com/malfet
2025-03-14 17:27:47 +00:00
42e468d9b0 [MPSInductor] Adjust check_bounds (#147205)
To make upper bound inclusive, which fixes `test_vectorized_ops_masked` and results in the following code
```python
mps_lib_0 = compile_mps_shader("""
    #include <c10/metal/random.h>
    #include <c10/metal/special_math.h>
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = (xindex) % (64);
        int x1 = (xindex) / (64);
        auto tmp5 = in_ptr0[x0 + 63*x1];
        int x2 = xindex;
        auto tmp0 = x0;
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 63;
        auto tmp3 = tmp1 < tmp2;
        if (x0 > 63) return;
        auto tmp6 = tmp3 ? tmp5 : 7;
        out_ptr0[x2] = static_cast<float>(tmp6);
    }
""")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147205
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #147211
2025-03-14 17:26:00 +00:00
cyy
a9aae05a6b Remove test decorations on MacOS 12 (#148942)
MacOS 12 may reach EOL, as from https://endoflife.date/macos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148942
Approved by: https://github.com/malfet
2025-03-14 17:22:37 +00:00
f2ea77c099 [MPS] Add inductor support for i0e. (#149180)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149180
Approved by: https://github.com/malfet
2025-03-14 16:15:52 +00:00
71795f159e Revert "[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167)"
This reverts commit bea181ff7eeead9fcdd806e286846296c4ab2d67.

Reverted https://github.com/pytorch/pytorch/pull/149167 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D71177501 for the failure. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149167#issuecomment-2725001232))
2025-03-14 15:16:21 +00:00
706c22549c [MPS] Add support for i0e in eager. (#149174)
Add `special.i0e` to XFAIL_GRADLIST for now, as its backward op is not yet implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-14 14:43:46 +00:00
68bbe20db7 Add test coverage (#149182)
Summary: Follow up from D71160718

Differential Revision: D71177037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149182
Approved by: https://github.com/houseroad
2025-03-14 09:38:29 +00:00
c95a6b416b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-14 08:50:30 +00:00
05ac99042f Clean up grid in execution trace (#149159)
Summary: This DIFF https://www.internalfb.com/diff/D70471332 removed input "grid" when calling triton kernel. PyTorch execution trace need to make the appropriate change. It includes capturing ET and replay ET.

Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda  -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_with_pt2_cuda

buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay

Differential Revision: D71152464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149159
Approved by: https://github.com/sraikund16, https://github.com/jansel
2025-03-14 07:12:16 +00:00
be4e6c1c8e Revert "[MPS] Add support for i0e in eager. (#149174)"
This reverts commit b4745db90482ff139ea62d06ec0a18468e1131b7.

Reverted https://github.com/pytorch/pytorch/pull/149174 on behalf of https://github.com/malfet due to MPS are red on trunk ([comment](https://github.com/pytorch/pytorch/pull/149174#issuecomment-2723774600))
2025-03-14 06:35:01 +00:00
e162758051 [MPSInductor] Add bessel_[jy][01] ops (#149179)
By simply calling corresponding special functions

Followup TODO: tweak bessel_y0 to match CPU implementation for `torch.half` dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149179
Approved by: https://github.com/dcci
ghstack dependencies: #149123
2025-03-14 06:33:30 +00:00
d4496346b9 Update logic when producing key name for keep_original_weights (#149171)
Differential Revision: D71160718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149171
Approved by: https://github.com/houseroad
2025-03-14 05:29:54 +00:00
db6d72213b [MPS] Add torch.special.bessel_[jy][01] implementations (#149123)
By copy-n-pasting functions from
f59064f2b7/aten/src/ATen/native/cuda/Math.cuh (L1463)

With an  ugly workaround for `bessel_y[01]` to avoid internal compiler exception on M1/M2 machines (see FB16863363 /  https://gist.github.com/malfet/e7785e4b572e7740887a83a2386ef769 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149123
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-03-14 05:13:55 +00:00
e6839819c8 Revert "[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527)"
This reverts commit 4f8391db55c8c3a574d61d99d6d6a4a0b6723acb.

Reverted https://github.com/pytorch/pytorch/pull/147527 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally.  @albanD, would you be able to help them land the fixes internally? The error looks really simple. See D71152448 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/147527#issuecomment-2723531085))
2025-03-14 05:11:01 +00:00
9e6b2ca58d Fix sympy float priting (#147552)
Fixes https://github.com/pytorch/pytorch/pull/147261
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147552
Approved by: https://github.com/bobrenjc93, https://github.com/cyyever
2025-03-14 05:07:06 +00:00
bea181ff7e [AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167)
Summary:
We add swap_constant_buffer in pybind to add tests.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_update_inactive_constant_buffer

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149167
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-14 04:12:48 +00:00
e567900998 [AOTInductor] Activate CPU test for update_constant_buffer (#149162)
Summary:
Fixed by #145459

Test Plan:
Re-activating tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149162
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-14 04:09:57 +00:00
aed0b7a742 [c10d] Add param recording for uniqueID broadcasting and allgather (#149166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149166
Approved by: https://github.com/kwen2501
2025-03-14 03:51:30 +00:00
b4745db904 [MPS] Add support for i0e in eager. (#149174)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174
Approved by: https://github.com/malfet
2025-03-14 02:51:28 +00:00
c179971bfc xpu: update filter out of dg2 AOT target (#148677)
torch-xpu-ops has updated list of AOT targets to use and used `dg2` instead of `dg2-g10`. This requires an update in cpp_extension.py which currently filters out `dg2-` prefixed AOT targets.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148677
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD
2025-03-14 02:24:06 +00:00
56b2e4b8f0 ci: Update linux.20_04 --> linux.24_04 (#149142)
Ubuntu 20.04 is getting deprecated soon so we might as well proactively
move to the latest LTS which is 24.04

> [!NOTE]
> The oldest supported version of python on 24.04 is Python 3.8. Since we test for Python 3.6 compat in our collect_env test we need to have this particular job stick with 20.04 for now until we decide to upgrade it to a newer python version.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149142
Approved by: https://github.com/atalman, https://github.com/wdvr
2025-03-14 02:20:10 +00:00
cyy
e66ad221e9 Use std::string_view in get_fully_qualified_type_name (#145197)
The same as #139164 but open a new PR due to messy history there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145197
Approved by: https://github.com/r-barnes
2025-03-14 01:58:35 +00:00
e8d36019d4 [c10d] Make getDefaultBackend more fault tolerant without relying on exceptions (#149152)
Summary: no-except builds are terminating when this exception is thrown. We should proactively check if a backend is available before calling has_hooks, instead of trying and failing.

Test Plan: CI

Differential Revision: D71144456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149152
Approved by: https://github.com/kwen2501
2025-03-14 01:27:52 +00:00
15cd6921a5 [export] Fix tensor_constant and buffer naming conflicts in TS converter (#148803)
Summary: In TS converter, tensor constants are traced as BUFFER and later we will convert them back to CONSTANT_TENSOR. So we need to prevent naming conflicts during lift constant pass.

Test Plan: CI

Differential Revision: D70826426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148803
Approved by: https://github.com/angelayi
2025-03-14 00:38:12 +00:00
49570cb402 Revert "Split up cub-RadixSortPairs.cu to parallelize compilation (#148936)"
This reverts commit 9a3d26cfcdb1c1be84a04baa3ee554dbe67cb049.

Reverted https://github.com/pytorch/pytorch/pull/148936 on behalf of https://github.com/ZainRizvi due to Breaks lint in trunk [GH job link](https://github.com/pytorch/pytorch/actions/runs/13845459825/job/38742803351) [HUD commit link](9a3d26cfcd) ([comment](https://github.com/pytorch/pytorch/pull/148936#issuecomment-2722853628))
2025-03-13 22:54:33 +00:00
4cae8f48cc [ROCm] Improve softmax performance (#149076)
This patch improves the performance of softmax for 2D tensors by:

using a softmax calculation which eliminates the increase of shared memory usage with the size of the tensor and relies on global memory accesses for the tensor data accesses while still using shared memory for the actual reduction step (the shared memory used for the reduction is constant and does not increase with tensor size).
for the final computation replacing the division by the sum with the multiplication of 1/sum. The 1/sum is computed as the last step of the warp reduction.
replace the use of the exp function with the __expf function.
The impact on numerical accuracy is within a 1e-5 for half precision and 1e-7 for full precision.

The impact on performance for MI300X is between 22% and 50% percentage improvement over current runtimes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149076
Approved by: https://github.com/jeffdaily
2025-03-13 22:07:28 +00:00
9a3d26cfcd Split up cub-RadixSortPairs.cu to parallelize compilation (#148936)
Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes.

Differential Revision: D70539649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936
Approved by: https://github.com/suo, https://github.com/eqy
2025-03-13 22:02:05 +00:00
4098a229a0 Add back fake class registration to test_torchbind (#149137)
Fixes #149121

Summary: as title, to fix https://github.com/pytorch/pytorch/issues/149121

Test Plan:
```
 python test/export/test_torchbind.py
```

Differential Revision: D71129321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149137
Approved by: https://github.com/yiming0416
2025-03-13 21:26:37 +00:00
e5fccb2bab [pytorch] Fix duplicated Malloc/Free insertation when using IRBuilderBase::CreateMalloc/CreateFree in LLVM 18+ (#149058)
Summary:
Pytorch unitest hangs when jitting the Tensor kernel. The problem exists for LLVM version >= 18 due to this upstream change: 45bb45f2ae

`IRBuilderBase::CreateCall` will insert the instruction into the BasicBlock by default. And we don't need to explicitly insert the instruction when compiling the tensor kernel.

Test Plan:
## Test with the release toolchain
```
buck test 'mode/dev' //caffe2/test:jit -- --exact 'caffe2/test:jit - test_concat_invariant (test_jit_fuser_te.TestTEFuserDynamic)'
```
## Test with the Buckified toolchain
Apply this D71046097 to select the LLVM libraries.
```
# Build tests
buck build 'mode/dev-asan' //caffe2/test:jit --show-output
```
```
# Run test (Change HASH and paths accordingly)
HASH="b755f1c435832a1e"

ENABLE_FLATBUFFER=0 FB_OVERRIDE_PYBIND11_GIL_INCREF_DECREF_CHECK=1 MKL_NUM_THREADS=1 NO_MULTIPROCESSING_SPAWN=0 OMP_NUM_THREADS=1 PYTORCH_TEST=1 PYTORCH_TEST_FBCODE=1 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_DEV_DBG_ASAN=1 PYTORCH_TEST_WITH_TSAN=0 PYTORCH_TEST_WITH_UBSAN=1 SKIP_TEST_BOTTLENECK=1 TENSORPIPE_TLS_DATACENTER=test_dc TEST_PILOT=True TPX_IS_TEST_EXECUTION=true TPX_TIMEOUT_SEC=6000 \
buck-out/v2/gen/$HASH/caffe2/test/__jit__/jit.par --test-filter test_jit_fuser_te.TestTEFuserDynamic.test_concat_invariant
```

Differential Revision: D71046799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149058
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-03-13 20:37:47 +00:00
38e81a5332 [ROCm] Use generated CK config.h rather than system (#147993)
prevents pytorch from potentially using system version of config.h and instead prioritize the CK submodule's version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147993
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-13 20:04:12 +00:00
4f8391db55 [ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527)
This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527
Approved by: https://github.com/jeffdaily

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-03-13 19:56:26 +00:00
0dcd482e54 [SDPA] Respect sdpa_kernel's priority_order setting in torch.compile (#147768)
[https://github.com/pytorch/pytorch/pull/140467](https://github.com/pytorch/pytorch/pull/140467) added the option to specify a priority order for SDPA but the `torch.compile` path silently ignored this setting as I wasn't aware of the separate context manager handling on `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147768
Approved by: https://github.com/drisspg
2025-03-13 18:52:34 +00:00
5e1b715dda BC fix for AOTIModelPackageLoader() constructor defaults (#149082)
The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082
Approved by: https://github.com/desertfire
2025-03-13 18:40:53 +00:00
cyy
970fefcc53 Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940)
Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940
Approved by: https://github.com/mikaylagawarecki
2025-03-13 18:02:50 +00:00
c73c72b1e1 ci: Update linux_job references to v2 (#149102)
This is probably a bit overdue but trying to update these so we can
finally get rid of all the remnants that rely on non-manylinux2_28 stuff
and conda stuff

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149102
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #149104
2025-03-13 17:31:55 +00:00
77ea66695a ci: Fix check_binary gcc abi check (#149104)
All of our binaries should be built with the cxx11-abi now so lets fix
this check to reflect reality.

I also noticed that this particular script is not used widely since this
issue should've been caught in nightlies a long time ago.

Maybe worth an investigation to just remove this script if it's not
actually being used.

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149104
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet
2025-03-13 17:31:55 +00:00
7c87ec1b50 [ca] always do initial trace with dynamic shapes (#148801)
HUD: https://fburl.com/wzvx6tax no regressions (ignore the pass rate improvements, those come from #149030)
<img width="864" alt="image" src="https://github.com/user-attachments/assets/d7598f98-b378-4abb-a0c7-e4311162f681" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148801
Approved by: https://github.com/jansel
ghstack dependencies: #148799, #149030
2025-03-13 17:30:29 +00:00
b263b272fa [ca] fix lazily compiled aot bwd (#149030)
FIXES https://github.com/pytorch/pytorch/issues/137372

sometimes, the aot bwd is lowered lazily. so the bw_module we saved in CompiledFunction._lazy_backward_info hasn't gone through post grad passes, specifically the view_to_reshape pass. Running that directly will then sometimes error, because the AOT forward has already changed its views to reshapes, and it is reflected in the gradients we see in CA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149030
Approved by: https://github.com/bdhirsh
ghstack dependencies: #148799
2025-03-13 17:30:29 +00:00
e6f560a262 [ca] support for dynamic shapes CopySlices (#148799)
i'm changing CA initial trace to always trace as dynamic, fixes these errors:
```python
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.2139s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_autograd_python_custom_function_inplace
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0057s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_copy_slices_graph_task_updates - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_copy_slices_graph_task_updates
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.9662s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_inplace_on_view_weak_grad_fn - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_inplace_on_view_weak_grad_fn
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0077s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_leaf_assignment - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_leaf_assignment
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [5.0485s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_setitem_mask - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_setitem_mask
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0102s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_tensor_hooks_inplace_over_view - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_tensor_hooks_inplace_over_view
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148799
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-03-13 17:30:20 +00:00
e84cc4c052 Update Kineto Submodule (#149089)
Summary: We have made a lot of changes in Kineto this month. It is a good idea to update the submodule in now especially since the roctracer-sdk change will be very large

Test Plan: CI

Differential Revision: D71082829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149089
Approved by: https://github.com/Skylion007
2025-03-13 17:18:16 +00:00
6856d81c60 [BE]: Update CU128 cudnn to 9.8.0.87 (#148963)
Also cu12.6 is an on old CUDNN version, we may want to upgrade it for all the performance reasons as I don't see a manywheel linux reason to stay back on the old 9.5 release. I might split that into it's own PR. This one just updates CU126 to the latest and greatest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148963
Approved by: https://github.com/jansel, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/atalman
2025-03-13 16:59:12 +00:00
b9803a5c81 [AOTI] Re-enable AOTI cpp unit test (#149085)
Summary: test_inductor_aoti was removed by accident previously. Add it back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149085
Approved by: https://github.com/jbschlosser
2025-03-13 16:00:38 +00:00
3e605fe46d [CUDAGraph] Graph Partition (#147648)
This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR.

## Example
```python
import torch

torch._inductor.config.graph_partition = True

def f(x, y):
    x1 = x + 1
    y1 = y + 1
    y_cpu = y1.cpu() + 1
    z = x @ y
    return x1 + y1 + z + y_cpu.cuda()

x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)]
x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]]
eager_out = f(x, y)

f_compiled = torch.compile(f, mode="reduce-overhead")

for _ in range(5):
    compiled_out = f_compiled(x_cloned, y_cloned)
    assert torch.allclose(eager_out, compiled_out)
```

w/o graph partition, we will skip cudagraph:
```
skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from :
   File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f
    y_cpu = y1.cpu() + 1 # 3
```

w/ graph partition, we can see two cudagraphify under the same torch-compiled region:
![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25)

## Design

PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions.

The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is:
- compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`.
- during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`.
- finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`.

#### Q: How does it work with codecache?

While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later.

## Edge Case 1
PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`.

## Edge Case 2
Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420)

![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40)

To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648
Approved by: https://github.com/eellison
2025-03-13 16:00:21 +00:00
65d19a5699 Remove runtime dependency on packaging (#149092)
Looks like after https://github.com/pytorch/pytorch/pull/148924
We are seeing this error in nightly test:
https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623

```
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module>
    from .lowering import fallback_node_due_to_unsupported_type
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module>
    from . import kernel
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module>
    from . import mm, mm_common, mm_plus_mm
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'
```

Hence removing runtime dependency on packaging since it may not be installed by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092
Approved by: https://github.com/drisspg, https://github.com/davidberard98
2025-03-13 14:53:13 +00:00
f59064f2b7 [FIX] remove the duplicate key in DEFAULT_STATIC_QUANT_MODULE_MAPPINGS (#149043)
nn.Dropout appeared at line 81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149043
Approved by: https://github.com/jingsh
2025-03-13 12:42:33 +00:00
bdf57fb8f7 [AOTI][refactor] Split MiniArrayRef into a separate header (#149073)
Summary: MiniArrayRef is a common utility and will be used by the libtorch-free AOTI.

Differential Revision: [D71064657](https://our.internmc.facebook.com/intern/diff/D71064657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149073
Approved by: https://github.com/yushangdi
2025-03-13 11:57:32 +00:00
a8b1767ae5 [DTensor] Fix local_map with multi-threading (#149070)
Using `nonlocal device_mesh` is not safe with multi-threading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149070
Approved by: https://github.com/wanchaol
2025-03-13 10:58:59 +00:00
df60500ab8 Fix too big to optimize in test, actually use O0 when aot_inductor.compile_wrapper_with_O0 is set (#148714)
Summary:
1. Check against the "0" char instead

2. We got the following error when using anything other than O0 flag: `error: Function ZN5torch12aot_inductorL22__check_inputs_outputsEPP16AtenTensorOpaqueS3 is too big to optimize [-Werror,-Wignored-optimization-argument]` So we use O0 flag in wrapper code when `aot_inductor.compile_wrapper_opt_level` is set to `O0`.

Test Plan:
```
 buck run  'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:ads_second_stage_dsnn_models_aoti_lowering_test -- -r AdsSecondStageDSNNModelsAOTILoweringTest
```

Differential Revision: D70670957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148714
Approved by: https://github.com/desertfire
2025-03-13 10:22:06 +00:00
96a6a71ac7 skip test_torch_dynamo_codegen_pow if CPU backend is not cpp (#146595)
The test asserts that `aten.pow` is not present in the generated kernel code. When using a CPU backend other than cpp, the kernel contains comments referencing the aten ops that produced the kernel in this case `aten.pow`.

This PR skips that test case if the CPU backend is not cpp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146595
Approved by: https://github.com/williamwen42
2025-03-13 10:03:29 +00:00
d90f9e9a34 [inductor] Fix issue with set_linter, improve linter framework (#144620)
### `set_linter` only

* Fix gnarly [bug](dbed747aae/tools/test/set_linter_testdata/python_code.py.txt.python (L42)) which would have garbled Python files involving sets contained in sets.
* Better handling of new Python3.12 token types

### Both linters.

* Recover from and report on unparseable Python files
* Remove `ParseError.check()` (it made it harder to read the code)
* FileLinter is now generic on `PythonFile`

### Notes

As I started working on new docstring features, I found a nasty bug and an edge case bug in set linter, and realized both the linters crash when there is a badly-formed Python file in the repo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144620
Approved by: https://github.com/amjames, https://github.com/jansel
2025-03-13 09:49:40 +00:00
f4bffb7461 [docs] fix autograd description on convex function case (#148658)
The sub-gradient of minimum norm is the least steep descent direction.

```python
import torch

x = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True)
torch.relu(x).sum().backward()
print(x.grad) # tensor([0., 0., 0., 1., 1.])

y = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True)
torch.abs(y).sum().backward()
print(y.grad) # tensor([-1., -1.,  0.,  1.,  1.])
```

(How can I request a reviewer? I don't have the button on the right)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148658
Approved by: https://github.com/lezcano
2025-03-13 09:06:15 +00:00
75c8b7d972 [Profiler][HPU] Fix incorrect availabilities for HPU (#148663)
Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/albanD
2025-03-13 08:03:52 +00:00
eqy
ec93aa7f84 fix cuDNN SDPA meta registration (#148921)
Update `cuDNN SDPA` meta registration to matching memory layout behavior in: https://github.com/pytorch/pytorch/pull/138354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148921
Approved by: https://github.com/drisspg, https://github.com/jbschlosser
2025-03-13 07:33:16 +00:00
2a7d583452 Consolidate torchbind fake class registration (#149063)
Summary: Remove duplicated fake class registration

Test Plan: CI

Differential Revision: D71052419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149063
Approved by: https://github.com/angelayi
2025-03-13 06:57:13 +00:00
c208f21791 [Dynamo] Replace unimplemented withunimplemented_v2 in torch/_dynamo/variables/base.py (#148177)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148177
Approved by: https://github.com/williamwen42
2025-03-13 06:35:51 +00:00
037d7af778 [Inductor UT] Enable PYTORCH_TESTING_DEVICE_ONLY_FOR test case filter for test_torchinductor.py (#149023)
The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices
in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to
make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR.
eg. Only enable GPU cases, not CPU cases even HAS_CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-13 05:15:28 +00:00
7cdbb913e7 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-13 03:50:58 +00:00
3646d4dbc8 [partitioner] always ban compiler-driven recompute of collectives by default (#147561)
This should fix the hang in https://fb.workplace.com/groups/1075192433118967/permalink/1603268720311333/

The argument here is that:

(1) in general, it is not safe for the partitioner to sometimes choose to recompute collectives in the backward. Why? If we are running a distributed job, where many ranks are compiling at the same time, we need every rank to make a consistent decision about which collectives are recomputed for backward. If we let each compiler instance make its own choice without any cross-rank communication, they can make different choices and cause NCCL hangs (see the link above)

(2) later on, we'll want an `spmd_mode` flag that causes the compiler to issue collectives and communicate info across ranks. Once we have such a config, then turning it on should make it safe for the partitioner to potentially choose to recompute collectives (and agree on the binary "recompute-or-save" choice across all ranks)

(3) even without an `spmd_mode`, users can override this choice by using `torch.utils.checkpoint()` in their user code. User checkpointing generally always overrides the partitioner, and this should be safe because we expect the user to apply checkpointing consistently across ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147561
Approved by: https://github.com/zou3519
2025-03-13 03:36:13 +00:00
420a9be743 [regression] Fix pin_memory() when it is called before device lazy initialization. (#149033)
PR #145752 has added a check in the isPinnedPtr to check if a device is initialized before checking if the tensor is pinned. Also that PR has added a lazy initialization trigger when an at::empty is called with a pinned param set to true. However, when the tensor is firstly created and it is pinned in a separate call by calling pin_memory() function, lazy device init is not called so is_pinned returns always false.

With this PR, the lazy initialization is moved to getPinnedMemoryAllocator function, thus it is assured that device is initialized before we pin a tensor.

Fixes #149032

@ngimel @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149033
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-03-13 02:56:24 +00:00
f2d43d866c [cutlass backend] switch layout for cutlass backend benchmark (#149009)
```
python benchmarks/inductor_backends/cutlass.py
```

logs:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 13.059554621577263 |  1.580178506206721   |         NA          |
|        triton         | 10.245470330119133 | 0.04118620231747627  | -21.54808776410064  |
| triton_persistent_tma | 10.388538241386414 | 0.04225084185600281  | -20.45258400908819  |
|  cutlass_lvl_default  | 12.882896699011326 |  231.14990583620965  | -1.3527101626732294 |
|   cutlass_lvl_1111    | 11.362981051206589 |  126.41650272067636  | -12.99105229490415  |
|   cutlass_lvl_2222    | 11.107578873634338 |  555.8380545829423   | -14.946725248331441 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 14.037585817277431 | 0.21587548777461052  |         NA          |
|        triton         | 10.571777820587158 |  78.15654796129093   | -24.68948750735019  |
| triton_persistent_tma | 10.761583223938942 |  1.3195342738181353  | -23.337364672110443 |
|  cutlass_lvl_default  | 12.872588820755482 |  237.0100042372942   | -8.299126443010406  |
|   cutlass_lvl_1111    | 11.08622644096613  |  137.55013868492097  | -21.02469338195443  |
|   cutlass_lvl_2222    | 11.044904589653015 |   551.265836935956   | -21.319059178545007 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.483894050121307 | 0.27990864124149084  |         NA          |
|        triton         | 29.567627236247063 |  99.87172158574685   | -3.005740711366232  |
| triton_persistent_tma | 29.66325916349888  |  1.3695051120594144  | -2.692027748401006  |
|  cutlass_lvl_default  | 29.82821688055992  |  72.61214569816366   | -2.150897022812533  |
|   cutlass_lvl_1111    | 29.476772993803024 |   67.7428645719774   | -3.303780857728953  |
|   cutlass_lvl_2222    | 30.113255605101585 |  233.84051702311262  | -1.2158500630212203 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.58255836367607  | 0.058386584743857384 |         NA          |
|        triton         | 29.799651354551315 |  100.18178300186992  | -2.559978795150901  |
| triton_persistent_tma | 29.362043365836143 |  1.534341821912676   | -3.990885861562106  |
|  cutlass_lvl_default  |  29.4346883893013  |  73.68858492700383   | -3.7533484305817093 |
|   cutlass_lvl_1111    | 29.164200648665428 |  75.44329373072833   | -4.637799421958348  |
|   cutlass_lvl_2222    | 29.13798950612545  |  227.33327346481383  |  -4.7235056020244   |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1656.6237211227417 |  0.0549461180344224  |         NA         |
|        triton         | 1892.8285837173462 |  2.3174119112081826  | 14.258208401997386 |
| triton_persistent_tma | 1665.332317352295  |  2.7922237082384527  | 0.525683419747917  |
|  cutlass_lvl_default  | 1705.5492401123047 |  108.31571159465238  | 2.9533272019312116 |
|   cutlass_lvl_1111    | 1714.9059772491455 |  17.64627545280382   | 3.518134829489478  |
|   cutlass_lvl_2222    | 1680.4152727127075 |  306.9972395859659   | 1.4361469829637354 |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1621.416687965393  | 0.06300561130046844  |         NA         |
|        triton         | 1782.3902368545532 |  2.318530729971826   | 9.927956834535548  |
| triton_persistent_tma | 1586.0934257507324 |  2.7931175641715527  | -2.178543151605614 |
|  cutlass_lvl_default  | 1657.4617624282837 |  43.31810224894434   | 2.2230605328307784 |
|   cutlass_lvl_1111    | 1641.5367126464844 |  17.648567833006382  | 1.2408916739557292 |
|   cutlass_lvl_2222    | 1645.8417177200317 |  249.33647010894492  | 1.5064005407078918 |
+-----------------------+--------------------+----------------------+--------------------+
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-13 01:57:47 +00:00
4a12777ffe [Partitioner] Remove unnecessary upstream nodes in dependency viewer (#146580)
We iterate upstream nodes to update partition map. But actually did nothing due to we iterate nodes with reversed topological order https://github.com/pytorch/pytorch/pull/136608/files#diff-f2f9dd3903fd99955732eb694941fea0cb7301a58d59554787f3311d417e5615L193 so that there exists no upstream nodes in assignment. Remove it to reduce for-loop overhead which up to O(N * N) complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146580
Approved by: https://github.com/Skylion007, https://github.com/jerome-habana
2025-03-13 01:42:10 +00:00
1e37e5b836 Update nightly PyTorch version to 2.8.0 (#149038)
Branch for 2.7: https://github.com/pytorch/pytorch/tree/release/2.7
Same as https://github.com/pytorch/pytorch/pull/135916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149038
Approved by: https://github.com/ZainRizvi
2025-03-12 23:51:04 +00:00
e51615cb73 Revert "[Profiler][HPU] Fix incorrect availabilities for HPU (#148663)"
This reverts commit 28b78800b92a4d847a2360ab0e0b87d3e00a6138.

Reverted https://github.com/pytorch/pytorch/pull/148663 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, could you please help get this relanded? See D71052806 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148663#issuecomment-2719297055))
2025-03-12 22:52:11 +00:00
b1980b2405 Revert "Make dynamism code robust to NotImplementedException (#148823)"
This reverts commit 60576419a2a5cc09e4a92be870fda8f3fc305ddc.

Reverted https://github.com/pytorch/pytorch/pull/148823 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D71042206 for details. To validate your fixes internally before relanding, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148823#issuecomment-2719287467))
2025-03-12 22:45:39 +00:00
38c5cf99b3 [CI] Don't clean workspace when fetching repo (#147994)
Tested on https://github.com/pytorch/pytorch/pull/148995
Do two checkouts: first one attempts to use an existing checkout if possible.  The second one removes the workspace and re pulls everything if the first one fails

This is probably not going to be useful if we switch entirely to ephemeral runners but w/e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-12 22:29:52 +00:00
3f1769f785 Add ninja to requirements-ci for all arch (#148778)
So I can get ninja_logs for the builds
No negative consequences afaik
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148778
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-12 22:07:46 +00:00
0c8ec26d3b [ROCm][TunableOp] hipblaslt tf32 support (#145946)
TF32 is supported by hipblaslt. Support added by #143549.  This PR expands integration to the TunableOp feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145946
Approved by: https://github.com/pruthvistony, https://github.com/echen4096, https://github.com/yoyoyocmu

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2025-03-12 21:17:11 +00:00
ab45aaca97 Set non-strict export as default mode (#148790)
Summary:
- Flip the default value of strict argument in torch.export.export from True to False
- Update test infra to cope with the change, some of them made the assumption of strict mode as default
- Disabled some tests that fail in non-strict mode

Test Plan: Sandcastle

Differential Revision: D70228628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148790
Approved by: https://github.com/angelayi
2025-03-12 21:10:58 +00:00
e3ebf61589 Create and send full_tensor on ProcessGroup-supported device in _broadcast_tensors (#148865)
Fixes #138842

`device` is always the device of the `local_state_dict`, which may or may not be CPU, which is not supported by NCCL backend.

Instead, create broadcasted tensors on one of `pg._device_types` and then move the tensors back if `local_state_dict`'s `device` was not supported by the `ProcessGroup`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148865
Approved by: https://github.com/mori360
2025-03-12 20:56:31 +00:00
b5191b9312 [codemod][lowrisk] Fix deprecated use of 0/NULL in caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/fc-unpack.cc + 1 (#148996)
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D70939306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148996
Approved by: https://github.com/Skylion007
2025-03-12 20:06:19 +00:00
eqy
b90698f5ba [CUDA] try to abate some flakiness in test_stream_event_nogil (#148796)
threshold twiddling as one in a few dozen runs tend to fail the current threshold

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148796
Approved by: https://github.com/Skylion007
2025-03-12 19:12:50 +00:00
215f856142 Add XPU device to nested_layer_norm (#148593)
Work with https://github.com/intel/torch-xpu-ops/pull/1416 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148593
Approved by: https://github.com/guangyey, https://github.com/jbschlosser
2025-03-12 19:07:08 +00:00
66300d3d55 [cutlass backend] try make cutlass backend benchmark more robust (#149015)
Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/)

I want to make sure the benchmark even if failed on some experiment can still print most of the results.

```
Experiment group: mm (3x3, 3x3) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
|         name          | forward_time (us) | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+-------------------+----------------------+---------------------+
|         aten          | 6.175220478326082 |  0.5982149520423263  |         NA          |
|        triton         | 5.326753947883844 |  3.2067150759976357  | -13.739858089605114 |
| triton_persistent_tma | 5.340870004147291 |  3.279932268196717   | -13.51126615004617  |
|  cutlass_lvl_default  |        inf        |         inf          |         inf         |
|   cutlass_lvl_1111    |        inf        |         inf          |         inf         |
|   cutlass_lvl_2222    |        inf        |         inf          |         inf         |
|   cutlass_lvl_3333    |        inf        |         inf          |         inf         |
+-----------------------+-------------------+----------------------+---------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-12 18:59:49 +00:00
86bc154d61 [scan] Flattened output of HOP scan (#148955)
This is required because downstream operations expect HOPs to return a flattened list of output elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148955
Approved by: https://github.com/ydwu4
2025-03-12 18:27:27 +00:00
fb0e9cb0a0 Remove warnings on non-buffer tensor constants (#148483)
Export already registers tensor constants directly in the graph and this is also true for Torchbind objects. This removes warning that pollutes the output.

Differential Revision: [D70577856](https://our.internmc.facebook.com/intern/diff/D70577856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148483
Approved by: https://github.com/zhxchen17, https://github.com/zou3519
ghstack dependencies: #148364
2025-03-12 18:20:04 +00:00
29fd875bc1 Automate stable CUDA update and linter using min Python verison (#148912)
1. Fixes: https://github.com/pytorch/pytorch/issues/145571 . Cuda Stable is the same cuda version that is published to pypi, also used to set Metadata section in the rest of whl scripts and tag the docker releases with latest tag.
2. Updates min python version used in linter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148912
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-03-12 18:12:34 +00:00
01e9036bd2 skip torchbind in cosntant folding (#148993)
Summary:
Do not fold torchbind objects in constant folding

Any operation on these torchbind objects can have arbitrary side effects, so we can't effectively constant fold anything torchbind-obj-related anyway.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding
```

Reviewed By: angelayi

Differential Revision: D69946541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148993
Approved by: https://github.com/angelayi
2025-03-12 18:08:08 +00:00
923ce10f6c [while_loop] require stride to be the same as input for body_fn (#148002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148002
Approved by: https://github.com/zou3519
2025-03-12 17:15:10 +00:00
28b78800b9 [Profiler][HPU] Fix incorrect availabilities for HPU (#148663)
Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/albanD
2025-03-12 17:06:57 +00:00
b040dc3a53 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential [disconnected] Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-12 15:52:16 +00:00
626a5e22eb Revert "[CI] Don't clean workspace when fetching repo (#147994)"
This reverts commit e5fef8a08ebb8548e8413ae54ef0ad9a11f1f4c0.

Reverted https://github.com/pytorch/pytorch/pull/147994 on behalf of https://github.com/clee2000 due to broke checkout on xpu, probably lack of sudo? ([comment](https://github.com/pytorch/pytorch/pull/147994#issuecomment-2718335186))
2025-03-12 15:50:38 +00:00
9a0f65d3d3 [TD] test_cpp_extensions_aot_ninja corresponds to things in test/cpp_extensions (#148992)
Manually map test_cpp_extensions_aot_ninja to files in test/cpp_extensions since test_cpp_extensions_aot_ninja isn't an actual file you can edit, but a wrapper for files in test/cpp_extensions.

Idk if this is a good idea, feels very manual.  Maybe it would be better to classify this the same as any other TD failure where TD simply can't figure out the tests it needs to run
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148992
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/janeyx99
2025-03-12 15:40:06 +00:00
488c4480f9 [inductor] Fix profiler tests with latest Triton (#149025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149025
Approved by: https://github.com/yanboliang
2025-03-12 15:34:26 +00:00
5ada4e6a53 Revert "Reland: [inductor] Simplify grid handling (#148305)"
This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178.

Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))
2025-03-12 14:58:43 +00:00
cyy
8fa81a6066 Enable misc-use-internal-linkage check and apply fixes (#148948)
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.

The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.

The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
2025-03-12 14:22:56 +00:00
f349304c08 [Inductor][CPP] Fix expr issue in loop split (#148882)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/148058. In this case, there is an `indexing_expr` as an integer which doesn't have the method of `find`.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_148058
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148882
Approved by: https://github.com/jgong5
2025-03-12 11:08:07 +00:00
81aee3c9c4 [Partitioner] Reduce time consuming of partitions merger (#146582)
This patch optimize maybe_merge_partition func through 3-ways:

Remove unnecessary copy https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L99. The number of copied nodes is large if we can merge all of the nodes of graph into one partition.
Record users of each partition to avoid duplicate iteration over nodes https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L133. The trip count of this loop maybe very large.
The nodes number of each partitions maybe not balance https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L145. We always encounter one issue: one partition has n nodes, but the other has one node. Merge the smaller partition into the larger can help to reduce time consuming.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146582
Approved by: https://github.com/jerome-habana, https://github.com/Skylion007
2025-03-12 09:24:38 +00:00
d547a56668 [AMD] Various fixes for mem efficient attention on CK backend (#148986)
Summary: Decouple aotriton vs. ck for mem efficient attention. Also fixed HW check.

Reviewed By: henryhu6

Differential Revision: D70872677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148986
Approved by: https://github.com/jianyuh, https://github.com/houseroad
2025-03-12 07:36:46 +00:00
924a247fbb [MPS] Enable angle and atan2 for torch.long (#149017)
This check was added by https://github.com/pytorch/pytorch/pull/85817, that introduced no unit-tests and its content seems to be totally unrelated to title/subject of that PR. Anyway, right now it seems to be working fine on MacOS-13+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149017
Approved by: https://github.com/dcci
2025-03-12 04:48:52 +00:00
7b78a2c415 [MPSInductor] Fix argmin/argmax long reductions (#149021)
By adding an additional indexes array for aggregates and populating it when performing partial reductions.

And with that I can finally `torch.compile` TinyStories and get 600+ tokens/sec vs <200 on eager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149021
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004, #149020
2025-03-12 04:39:29 +00:00
758522d56a [MPSInductor][EZ] Fix argmin/max signatures (#149020)
threadgroup_argmin used to return input type, which is wrong, it should have returned `int` or `long`

Change signatures of both thredgroup_argmin and threadgroup_argmax to return int, as group size is small, no need to carry over large integeres
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149020
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004
2025-03-12 04:39:29 +00:00
fe22db9cc3 [MPSInductor] Fix min/max reductions over large dims (#149004)
Simple followup after sum/prod

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149004
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975
2025-03-12 04:39:19 +00:00
clr
2a7e997b3f test/dynamo/test_utils: Fix one broken test on different python versions (#148987)
We correctly handed different python version in the explicit ir_nodes test, but
didn't handle it in the dynamo_timed test. Just explicitly deleting the fields
there so the dynamo_timed test passes on all python versions.

(I noticed it breaking on 3.13).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148987
Approved by: https://github.com/jansel
2025-03-12 02:11:08 +00:00
e40a9e602b Add the max_autotune tests in the periodic jobs. (#143560)
To promptly detect issues with max_autotune, such as [#143102](https://github.com/pytorch/pytorch/issues/143102), add the max_autotune tests to the periodic CI to track the accuracy regularly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2025-03-12 01:47:46 +00:00
60576419a2 Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-12 01:01:57 +00:00
46f096bba6 Explicitly set use-ephemeral runners for windows nightly cpu test jobs (#149001)
This PR migrated windows builds to use ephemeral runners: https://github.com/pytorch/pytorch/pull/134463 however missed test jobs.

Explicitly set use-ephemeral runners for windows nightly cpu tests.
Please note we should be using already ephemeral runners for these after: https://github.com/pytorch/test-infra/pull/6377 (recently migrated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149001
Approved by: https://github.com/malfet
2025-03-11 23:51:39 +00:00
5b60749e9e [cudagraph] add log for skip reasons (#148797)
Summary: Add skip reasons to dynamo_compile so we can know popular skip reasons for cudagraph

Test Plan: {F1975906635}

Differential Revision: D70820791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148797
Approved by: https://github.com/masnesral
2025-03-11 23:31:48 +00:00
98a2d905bf [MPSInductor] Fix large prod and sum reductions (#148975)
After this change, if reduction dimension is larger than `max_threadgroup_size`,  emit a `for` loop from `codegen_iteration_ranges_entry` and wrap it up in `codegen_body()`
I.e. after this changes following command
```
% TORCH_LOGS=output_code python -c "import torch;print(torch.compile(lambda x:(x[0::2].sin()+(x[1::2] + .4).cos()).sum(dim=0) - 3.14)(torch.rand(4096, device='mps')))" 2>&1|cut -c 86-
```
will emit following shader
```metal
#include <c10/metal/random.h>
#include <c10/metal/special_math.h>
#include <c10/metal/utils.h>
#include <c10/metal/reduction_utils.h>
kernel void generated_kernel(
    device float* out_ptr1,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[1024];
    tmp_acc_0[r0_index] = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        if (r0_0 >= 2047) break;
        auto tmp0 = in_ptr0[2*r0_0];
        auto tmp2 = in_ptr0[1 + 2*r0_0];
        auto tmp1 = metal::precise::sin(tmp0);
        auto tmp3 = 0.4;
        auto tmp4 = tmp2 + tmp3;
        auto tmp5 = metal::precise::cos(tmp4);
        auto tmp6 = tmp1 + tmp5;
        tmp_acc_0[r0_index] += tmp6;
    }
    auto tmp7 = c10:🤘:threadgroup_sum(tmp_acc_0, 1024);
    auto tmp8 = 3.14;
    auto tmp9 = tmp7 - tmp8;
    out_ptr1[0] = static_cast<float>(tmp9);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148975
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #148969
2025-03-11 22:46:41 +00:00
2dcdb4ba78 [ez] include config as part of __all__ in torch.compiler (#148978)
Right now we are susceptive to a race condition where if the torch.compiler.config is not implicitly import via dynamo/builder.py, we will throw an error when trying to set compiler configs. This fixes it by including config in `__all__`.

Previous
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
>>> torch.compiler.config.dynamic_sources =
"L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
```

Now
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148978
Approved by: https://github.com/bdhirsh, https://github.com/laithsakka
2025-03-11 21:58:38 +00:00
a6459afb0e [dynamic shapes] add backed_size_oblivious option (#148696)
Adds option `torch.fx.experimental._config.backed_size_oblivious = True` to allocate `[0, inf]` instead of `[2, inf]` ranges for size backed symbols, and opting into size-oblivious semantics for them.

Helps in a number of cases like
- Keeps `[0, inf]` bounds for unbacked symbols, when we make a unbacked -> backed replacement
- More sound handling for 0/1 inputs at runtime when we lower from export
- Avoids ends-of-bounds, sys.maxsize constraint violations for exporting with named Dims (https://github.com/pytorch/pytorch/issues/146315, https://github.com/pytorch/pytorch/issues/146046)

May look towards turning this on globally for export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148696
Approved by: https://github.com/bobrenjc93
2025-03-11 21:52:34 +00:00
53a1a022a9 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 21:49:46 +00:00
b98af95401 Fix DCP link (#148974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148974
Approved by: https://github.com/svekars
2025-03-11 21:26:37 +00:00
6119ffc711 [ROCm][TunableOp] Fix TunableOp BLAS logging for online tuning case. (#148979)
In a previous PR https://github.com/pytorch/pytorch/pull/147034, there was a bad merge at the last minute.
BLAS logging works for offline tuning, but does not currently work for online tuning.

This PR fixes BLAS logging for online tuning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148979
Approved by: https://github.com/jeffdaily
2025-03-11 21:20:04 +00:00
e5fef8a08e [CI] Don't clean workspace when fetching repo (#147994)
Tested on 874c5dc4c98cc63a06bfc900d03683b02f110d7c'
Also tested on https://github.com/pytorch/pytorch/actions/runs/13798178199/job/38594767529?pr=148995#step:4:12

Don't remove the workspace when fetching.  The checkout action performs git clean -ffdx to remove untracked files and files in gitignore

This is probably not going to be useful if we switch entirely to ephemeral runners but w/e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-11 21:10:56 +00:00
72d9f88ef2 [release] Move triton pin to latest triton release/3.3.x (#148971)
This branch contains latest AMD cherry-picks:
https://github.com/triton-lang/triton/pull/6171
https://github.com/triton-lang/triton/pull/6165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148971
Approved by: https://github.com/danzimm
2025-03-11 21:10:42 +00:00
e6ef0620cc Add shim.h C API to call dispatcher on our own aten ops (#148832)
This PR still needs testing through some cpp extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148832
Approved by: https://github.com/albanD, https://github.com/atalman
ghstack dependencies: #148124
2025-03-11 21:02:04 +00:00
cf19efd3d9 Support basic TorchBind in aot_compile and aoti_compile_and_package (#148506)
Summary:
**Codegen**

- Skip some codegen parts for torchbind (such as arg decleration) because they are loaded in proxy executor, so we do not need to declare torchbind args in cpp code
- Added a helper method to get the schema of CallTorchBind HOP. The returned schema is only the schema of `obj.method()`.

**Serialization**
Add support for torchbind object in serialization

- For CallTorchBind HOP, we need to handle it specially because of it's schema. The output serialized args is in the format of `(obj, method, *args, **kwargs)`.
- it.TorchBindObject inputs are serialized to `as_custom_obj` Argument.

**Packaging**

Add torchbind objects file and `custom_objs_config.json` file to generated files output of `aot_compile`.

The json file is stored in the `data/aotinductor/<model_name>` folder in pt2 archive.

The torchbind objects are stored in data/constants/ folder in pt2 archive.
The format of torchbind objects are `f"{CUSTOM_OBJ_FILENAME_PREFIX}{custom_obj_idx}"`. e.g. `custom_obj_0`.
CustomClassHolder objects implement their own pickle methods.

Note that this `custom_objs_config.json` file is different from the `model_constants_config.json` file produced in package_sigmoid(). The keys in `custom_objs_config` directly correspond to the arg name in extern nodes json.
The key in `model_constants_config.json` produced by `package_sigmoid` is the attribute name in the user mode code.

This is required for both internal and OSS torchbind support.
For OSS torchbind support, we also need to package torchbind_constants into the .pt2 output.

**Work Left**
We still need to add torchbind support in ProxyExecutor for inductor.aoti_load_package to work. See other diffs in the stack.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D69490718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148506
Approved by: https://github.com/angelayi
2025-03-11 20:55:18 +00:00
f69e58e8e8 [CI] Update crossvit_9_240 as pass (#148989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989
Approved by: https://github.com/ZainRizvi
2025-03-11 20:54:39 +00:00
b54cf1a281 Revert "[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)"
This reverts commit 73c8068cf889829fb811fc75baac03163c9a42ee.

Reverted https://github.com/pytorch/pytorch/pull/148693 on behalf of https://github.com/ZainRizvi due to This is breaking lint on trunk. Please rebase these changes before merging them back in. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13796723235/job/38590020554) [HUD commit link](73c8068cf8) ([comment](https://github.com/pytorch/pytorch/pull/148693#issuecomment-2715671875))
2025-03-11 20:50:23 +00:00
c18858d633 [MPS] Make torch.mps.compile_shader public (#148972)
It was a private method in 2.6, but nothin changes in its API for 2.7
and it will likely remain the same in 2.8, so time to remove underscore
from its name

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148972
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/seemethere, https://github.com/albanD, https://github.com/dcci
2025-03-11 20:20:58 +00:00
abcec55532 gracefully handle tokenize.TokenError in funcname parser. Adds support for non-Python source (#148737)
This change allows defining python functions in non-python source and having them be able to compiled by torch.compile. The existing implementation already returns None for the case where the file couldn't be read, so returning None (by making an empty funcname cache) makes sense for the case of non-python source code too.

Example [basilisp](https://github.com/basilisp-lang/basilisp):
```clojure
(import torch)
(import [torch.nn.functional :as F])
(torch/rand 10)

(defn f {:decorators [torch/compile]} [x]
  (* (F/relu x) x))

(f (-> (torch/randn 100)
       (.cuda)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148737
Approved by: https://github.com/williamwen42
2025-03-11 19:49:28 +00:00
73c8068cf8 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-11 19:38:40 +00:00
5b8da17681 [cutlass backend] Add addmm and bmm tests for AOTI (#148929)
Needs to do:
1. Expand addmm tests to cover all 4 shapes
2. Add dynamic shape support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148929
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-03-11 19:38:24 +00:00
7b2ecb80eb [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148928)
Differential Revision: D70908557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148928
Approved by: https://github.com/angelayi
2025-03-11 19:36:30 +00:00
61f9b50e09 [ROCm] Fix TORCH_CHECK for hdim 512 support added in AOTriton 0.9b (#148967)
Fixes #148850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148967
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-03-11 19:21:10 +00:00
971606befa Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman
2025-03-11 19:12:46 +00:00
4d10da731b [ROCm] CK Memory-Efficient Attention (attention bias support) (#147778)
Implements CK as the backend for memory efficient attention with a couple caveats:

- Still enabled via `torch.backends.cuda.preferred_rocm_fa_library("ck")
- Does NOT support Nested Tensors

Using the mem_eff path allows us to use attention bias with a CK sdpa backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147778
Approved by: https://github.com/houseroad
2025-03-11 19:02:59 +00:00
a1cb67b69e [ROCm] Improve backwards indexing when stride is not one (#147630)
Improve backwards indexing when stride is not one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147630
Approved by: https://github.com/jeffdaily
2025-03-11 19:02:48 +00:00
daff65d671 Correctly propagate exception to parent tx (#146502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502
Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #146504, #146499
2025-03-11 18:55:45 +00:00
fb53e9e514 Add __context/cause/suppress_context/traceback__ to Exception (#146499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499
Approved by: https://github.com/zou3519, https://github.com/anijain2305
ghstack dependencies: #146504
2025-03-11 18:55:45 +00:00
4e7d264cf8 Introduce UserDefinedExceptionClassVariable (#146504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504
Approved by: https://github.com/anijain2305
2025-03-11 18:55:45 +00:00
8d08b49015 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-11 18:51:06 +00:00
c916a8efc5 Revert "Use the device interface for detecting Triton availability (#139171)"
This reverts commit 940b60db974f08a31c746eec2f9c399fc8a861ee.

Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))
2025-03-11 18:49:21 +00:00
57ee821a41 fix dynamo ide (#148849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148849
Approved by: https://github.com/bobrenjc93
2025-03-11 18:43:30 +00:00
883fb78c7e Update jinja2 version in requirements-gha-cache.txt
As previous version is vulnerable  to CVE-2025-27516
This closes Dependabot report
2025-03-11 11:42:38 -07:00
5ee9dbc0a1 Bump jinja2 from 3.1.5 to 3.1.6 in /.ci/docker (#148812)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-11 11:39:55 -07:00
cyy
a5f6b24d87 Remove outdated skipIfRocmVersionLessThan decorations (#148941)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148941
Approved by: https://github.com/jeffdaily
2025-03-11 18:37:40 +00:00
ef6296e7f2 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
b366f33606 [MPSInductor] Prep for mutlistage reductions (#148969)
----

- Move reduction variable initialization from `loads` to  `indexing_code`
- Move barriers from `codegen_kernel` to `reduction` and only use them for `any` reductions (as other reduction ops do  barriers explicitly inside the respective reduction functions)
- Use `self.compute` instead of `self.body` for all compute operations

Checked that number of before/after failures stays at `164 failed, 616 passed, 53 skipped`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148969
Approved by: https://github.com/dcci
2025-03-11 18:35:23 +00:00
dcc502f376 [ROCm][TunableOp] Add bias data type to params signature. (#146227)
Add bias vector data type in TunableOp params signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227
Approved by: https://github.com/jeffdaily
2025-03-11 18:31:22 +00:00
52acc1f955 [DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/140898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918
Approved by: https://github.com/fduwjj, https://github.com/mori360
ghstack dependencies: #148825
2025-03-11 18:24:12 +00:00
e0d4c43ad1 Add env for disabling meta reference on functionalization. (#148822)
Fix: https://github.com/pytorch/xla/issues/8755

This PR introduces `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE`
environment variable. Setting this variable makes it so the
functionalization kernels won't run the meta reference, which is used to
propagate expected sizes and strides.

Currently, PyTorch/XLA doesn't actually propagates the correct strides
to its tensors. It was also shown that calling these meta functions may
incur in significant overhead.

Running the provided minimal reproducer (see issue), we see a speedup
close to 4.3x:

- Baseline: 0.0747s
- `XLA_DISABLE_FUNCTIONALIZATION=1`: 0.0159s
- `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1`: 0.0175s

In summary, this PR:

- Creates the `disable_meta_reference()` function, which checks whether
  the environment variable is set
- Modifies codegen for functionalization kernels, adding the call to
  `disable_meta_reference()` function to the appropriate conditions
- Creates a new bash function for running `lazy/test_ts_opinfo.py` with
  the environment variable set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148822
Approved by: https://github.com/bdhirsh
2025-03-11 16:13:35 +00:00
09029010e5 [inductor] Fix create_specialize_impl error in latest Triton (#148933)
```py
$ python test/inductor/test_triton_kernels.py KernelTests.test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1
WARNING:torch._dynamo:Encountered an exception in identify_mutated_tensors, assuming every input is mutated
Traceback (most recent call last):
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 715, in identify_mutated_tensors
    ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 289, in generate_ttir
    specialization = _get_specialization(ordered_args.values())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 262, in _get_specialization
    specialize_impl = triton.runtime.jit.create_specialize_impl()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: create_specialize_impl() missing 1 required positional argument: 'specialize_extra'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148933
Approved by: https://github.com/yanboliang, https://github.com/davidberard98
2025-03-11 15:54:47 +00:00
16560d4e8f Revert "Refactor test/test_torch.py by moving testcase to test_indexing.py (#148875)"
This reverts commit 0fa0a740958ffc474843ceb1d19ee43c4bff4c09.

Reverted https://github.com/pytorch/pytorch/pull/148875 on behalf of https://github.com/ZainRizvi due to That torch.version failure you got in CI was a legitimate failure and is now breaking trunk. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13778023702/job/38534207536) [HUD commit link](0fa0a74095) ([comment](https://github.com/pytorch/pytorch/pull/148875#issuecomment-2714757288))
2025-03-11 15:27:25 +00:00
3945954741 Bump triton pin. Add aarch64 triton build (#148705)
1. Bumps pin for triton to release/3.3.x branch
2. Bump pin for triton-xpu
3. Remove ROCm xfail tests
4. Add aarch64 triton build:
	* Depends on: https://github.com/pytorch/pytorch/pull/148768
	* Fixes: https://github.com/pytorch/pytorch/issues/130558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148705
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/EikanWang
2025-03-11 15:12:21 +00:00
c983e1124c Revert "[WIP] Initial implementation of Grouped Gemm API (#148531)"
This reverts commit ff29791ed8f815bdbca1a5606de046380baca69d.

Reverted https://github.com/pytorch/pytorch/pull/148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](https://github.com/pytorch/pytorch/pull/148531#issuecomment-2714577498))
2025-03-11 14:40:58 +00:00
f1787ee0f7 [dynamo] Remove L scoping for recompilation messages (#148917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148917
Approved by: https://github.com/williamwen42
2025-03-11 14:26:26 +00:00
992838e702 [dynamo][guards] Do not ID_MATCH on numpy tensors (#148923)
Might help with https://github.com/pytorch/pytorch/issues/148535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148923
Approved by: https://github.com/jansel
2025-03-11 14:20:26 +00:00
ee21ccc816 Skip ao_sparsity TestComposability for missing FBGEMM (#144146)
Those tests (from test_ao_sparsity) require FBGEMM which may not be available. So add the skip decorator.

Fixes #87364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144146
Approved by: https://github.com/jerryzh168, https://github.com/jcaip
2025-03-11 13:02:18 +00:00
da4bb72a71 Backout D70075331 (#148824)
Summary:
The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0"

So we revert D70075331 as a workaround now.

Test Plan: The model could be lowered and published successfully. e.g. 702869739_16

Differential Revision: D70823254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824
Approved by: https://github.com/eqy
2025-03-11 12:51:17 +00:00
9ad64ce795 [triton 3.3] Forward-fix mm template selection logic (#148924)
Follow-up from https://github.com/pytorch/pytorch/pull/148662.

The logic from https://github.com/pytorch/pytorch/pull/148662 is incorrect; what we want is "choose the second template 'AMD-specific template' only if we're on hip AND triton version < 3.3" - negating it, the code should be "choose the cirst template if we're NOT on hip OR triton version >= 3.3".

Tested locally to verify that it fixes the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148924
Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/eellison
2025-03-11 09:05:44 +00:00
2bcc3acb90 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2025-03-11 08:02:30 +00:00
41e4728f74 update types on dynamo configs (#146873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146873
Approved by: https://github.com/williamwen42
2025-03-11 05:33:48 +00:00
1fcc4bc109 Don't look at TESTING_ONLY in fuzzer (#146870)
Lots of configs aren't meant to be set because they're testing only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146870
Approved by: https://github.com/masnesral
2025-03-11 05:32:25 +00:00
bed92a8523 [Window][Inductor UT] Fix for tempfile.NamedTemporaryFile(delete=True) not work on Windows. (#148632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148632
Approved by: https://github.com/jansel
2025-03-11 05:05:15 +00:00
ecfbfe1603 [AOTI] Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor (#148907)
Summary: shim.h is only meant for generic tensor util shim functions. We should switch to use the auto fallback generation, but it will need some extra care on the op schema.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148907
Approved by: https://github.com/janeyx99
2025-03-11 04:41:05 +00:00
940b60db97 Use the device interface for detecting Triton availability (#139171)
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.

This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
2025-03-11 03:56:11 +00:00
ff29791ed8 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 02:41:09 +00:00
621dadd4ca partitioner: when materializing unbacked tensor intermediates, apply hint to symbol, not expr (#144097)
Fixes https://github.com/pytorch/pytorch/issues/144095

open to suggestions: the `hint_int(..., fallback=...)` API feels like a bit of a footgun, because:

(1) we use the same guess for every unbacked symint (both symbols, and compound expressions)
(2) the user may have established some relationship between some unbacked symints that we are not taking into account.

I'm not sure how real of an issue (2) is - is it common to e.g. generate two unbacked symints, and then add a runtime assert that they are unequal?

Instead I did something simpler that's just enough to fix the linked issue: if we have a sympy expression containing an unbacked symbol (e.g. `u0 + 1`), then the partitioner will now fill in the symbol with our guess instead of the expression (plugging in `u0=4096` gets us 4097). This was important for an internal custom op, that had some logic like this:
```
def custom_op(x: [u0], y: [u0 + 1]):
    assert x.shape[0] = y.shape[0] - 1
    ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144097
Approved by: https://github.com/laithsakka
2025-03-11 02:11:57 +00:00
8c45d44abb Skip distributed subprocess test internally as they don't work (#148909)
Follow up from https://github.com/pytorch/pytorch/pull/146098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148909
Approved by: https://github.com/janeyx99
2025-03-11 02:07:45 +00:00
457ff9b7ae [reland][ca] side-effect free inital trace: compiled_args (#148376)
This reverts commit ea12fc8a9ff7da808e0b661ca07e9d4ce75d04bc.
Reland https://github.com/pytorch/pytorch/pull/147804, there was a bad import inserted by my linter.

Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376
Approved by: https://github.com/jansel
2025-03-11 01:57:36 +00:00
9fddbf3417 Update the comment (#148726)
Differential Revision: D70747931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148726
Approved by: https://github.com/yf225
2025-03-11 01:19:14 +00:00
0fa0a74095 Refactor test/test_torch.py by moving testcase to test_indexing.py (#148875)
Fix `FIXME` in `test_torch.py` by moving test-cases to `test_indexing.py`

```python
# FIXME: move to test indexing
# FIXME: move to indexing test suite
```

- Move tests in `test/test_torch.py` to `test_indexing.py`
- Remove `FIXME` comments

## TestResult

```bash
pytest test/test_torch.py -k TestTorchDeviceType -vv
pytest test/test_indexing.py  -k TestIndexing -vv
```

![image](https://github.com/user-attachments/assets/49a80985-e74a-4da6-a063-476e87e6aa8a)

![image](https://github.com/user-attachments/assets/77afa936-5dba-480c-b293-eb1f7bc74420)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148875
Approved by: https://github.com/soulitzer
2025-03-11 01:01:59 +00:00
c297c09a37 Fix invalid nested int guarding in broadcast_shapes() (#145957)
Fixes #145874

This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially.

Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957
Approved by: https://github.com/bobrenjc93

Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-03-11 00:53:13 +00:00
cyy
295f2ed4d1 Fix "invalid application of 'sizeof' to an incomplete type" (#148854)
Fixes with C++23 and constexpr std::unique_ptr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148854
Approved by: https://github.com/Skylion007
2025-03-11 00:40:00 +00:00
cyy
a6e71dbc88 Enable ASAN on inductor CUDA tests (#148749)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148749
Approved by: https://github.com/jansel
2025-03-10 23:53:40 +00:00
b215841ebb [MM] Add sm carevout to lowerings (#148793)
# Summary

See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using
the following to verify, need to figure out how to do proper guarding

This does  do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793
Approved by: https://github.com/lw, https://github.com/eellison
2025-03-10 23:49:26 +00:00
492f3fd5cf replace usages of upload_graph in inductor with tlparse (v2) (#148720)
Reland of https://github.com/pytorch/pytorch/pull/148703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720
Approved by: https://github.com/mengluy0125
2025-03-10 22:47:58 +00:00
5bbca7d328 [ROCm][Windows] Fix OpenMP Flags for clang-cl (#148097)
When clang-cl parses its command line arguments, it expects MSVC-style arguments (beggining with `/` such as `/WX`, `/MD`, etc.) to be provided, and clang-style arguments to be preceded by `-Xclang`, otherwise, the clang-style parameters are ignored as they are interpreted unrecognized compiler options.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148097
Approved by: https://github.com/jeffdaily
2025-03-10 22:47:15 +00:00
a95eb0c0a7 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 2149f6c6845d00711ffab648132b7377e8cd3edb.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))
2025-03-10 22:38:40 +00:00
12a95390ae [Minimizer] allow overriding of ShapeProp logic by subclasses of _MinimizerBase (#148784)
Summary:
The changes contained in this diff
- allow subclass Minimizer implementations to override the default shape propagation logic with custom logic
- copies over the meta attribute on get_attr graph nodes during the graph splitting step
- for both changes, behavior for existing classes do not change

Test Plan: CI

Differential Revision: D70799942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148784
Approved by: https://github.com/blaine-rister
2025-03-10 22:22:16 +00:00
fcb633fafa Introduce TORCH_ABI_VERSION and a runtime aoti_torch_abi_version C shim ABI (#148892)
Importable https://github.com/pytorch/pytorch/pull/148836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148892
Approved by: https://github.com/albanD
2025-03-10 22:22:10 +00:00
98b3f1db9f [Flex Attention] support num_heads > 1 in block_mask (#148857)
Previously flex decoding errors when block mask has num_heads > 1. So users have to use num_heads=1, or explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}`.

This PR fixes this issue. When not using grouped query attention (GQA, i.e., Hq == Hkv), we support block mask with num_heads = 1 and num_heads = num_query_heads (i.e., Hq). This is the same setting as flex attention kernel.

When using GQA (i.e., Hq != Hkv), we support block mask with num_heads = 1. When num_heads = Hq, we fall back to flex attention kernel so user don't need to explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}` anymore.

Why fallback? In the current flex decoding triton kernel, grouped query heads for the same kv head are handled by the same thread block. Supporting num_heads = Hq with GQA requires support different kv num blocks for different query heads in the same thread block, leading to lots of redundant workload. So we should better use the main flex_attention kernel where each query head is handled by a separate block.

Fixes #148527
Fixes #147267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148857
Approved by: https://github.com/drisspg
2025-03-10 22:02:50 +00:00
6ef15c7f46 [pytorch] Update flexattention bwd config generation (#148600)
Summary: Currently `flex_attention` template's backward config generation returns values for every case. This change instead stores intermediate values in `'bwd_config` returned at the end.

Test Plan: CI. Existing tests.

Differential Revision: D70649316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148600
Approved by: https://github.com/Skylion007
2025-03-10 22:00:56 +00:00
8701b302cc setuptools pinning (#148879)
Fixes #148877

---

On 9 March 2025, [setuptools](https://pypi.org/project/setuptools/#history) published a new version  and it is causing an issue on `pytorch` with the following error:

```
AttributeError: module 'distutils' has no attribute '_msvccompiler'. Did you mean: 'ccompiler'?
```

Last known working version is [75.8.2](https://pypi.org/project/setuptools/75.8.2/)

Currently it is affecting Windows ARM64 nightly build, however soon it might affect also Windows x64 builds. (conda version is not updated yet [setuptools conda](https://anaconda.org/anaconda/setuptools)

Locally both `Windows ARM64` and `Windows x64` are having same problem with the latest `setuptools` (>75.8.2)

---

This PR is pinning `setuptools` version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148879
Approved by: https://github.com/seemethere
2025-03-10 21:29:32 +00:00
c652772af7 [aarch64] install ninja for docker to build triton on arm (#148768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148768
Approved by: https://github.com/atalman, https://github.com/Skylion007

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-10 21:28:53 +00:00
b706044cca [ROCm][Windows] Enable hipblaslt for Windows (#148563)
This PR adds hipblaslt library as one of the Windows' dependencies. `rocBLAS` is added too, since certain symbols aren't detected with `hipblas` alone on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148563
Approved by: https://github.com/jeffdaily
2025-03-10 21:07:16 +00:00
2a1eeaeed8 Remove 12.4 x86 builds and 12.6 sbsa builds from nightly (#148895)
https://github.com/pytorch/pytorch/issues/145570

redo https://github.com/pytorch/pytorch/pull/148625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148895
Approved by: https://github.com/atalman
2025-03-10 20:55:09 +00:00
4a2173d9a0 [cutlass backend][ez] Incorporate AOTI dynamic shape test into main test of MM (#148786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148786
Approved by: https://github.com/jingsh
2025-03-10 20:35:10 +00:00
e9c12e819d Update torch-xpu-ops commit pin (#148881)
Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](026b2c8c7c), includes:

- Enable AOT for LNL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881
Approved by: https://github.com/EikanWang
2025-03-10 20:31:51 +00:00
ed969d1236 [DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825)
Summary:
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825
Approved by: https://github.com/fduwjj, https://github.com/mori360
2025-03-10 20:04:36 +00:00
494abeff8a CUDACachingAllocator,c10d: fixes for IPC release performance (#148805)
This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL.

1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock
2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there

Test plan:

Run with torchft + torchtitan

```
REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par
allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096

...

[rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61  loss:  7.4825  memory: 79.73GiB(83.89%)  tps: 317  tflops: 16.34  mfu: 1.65%
```

Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors

![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d)
![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito
2025-03-10 19:47:04 +00:00
2e4874e48d Update RELEASE.md with latest changes to release process and release 2.7 information (#148888)
1. Update for Release 2.7 compatibility matrix
2. Remove mention of builder project, the scripts for release management were migrated to test-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148888
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2025-03-10 19:20:27 +00:00
clr
6b0fd741d1 dynamo: Count number of opcodes processes (#147149)
This gives us a decent proxy for how big of a graph we functionally had to parse.

Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149
Approved by: https://github.com/jansel
2025-03-10 19:20:09 +00:00
3129faf8be Optimize shard_dim_alltoall to use alltoall_single (#148868)
as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies

This PR uses alltoall_single instead, so that we could minimize tensor copies.

tested on all the shard dim change tests and it works properly:
```
pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868
Approved by: https://github.com/tianyu-l
2025-03-10 18:38:12 +00:00
ed7e964f2b codecache.py: use str.format rather than % formatting (#148691)
Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691
Approved by: https://github.com/desertfire
2025-03-10 18:33:58 +00:00
d1f21d8ec3 Enable Direct Use of Arm Compute Library (ACL) in ATen (#148584)
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set.
Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without  oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620.
This patch enables such use cases by exposing ACL to ATen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584
Approved by: https://github.com/malfet
2025-03-10 18:29:51 +00:00
00cabd4235 [Inductor][Windows] add env_var switch to turn all Windows inductor UTs. (#148733)
For timeout reason, we can't turn on all Windows Inductor UTs in CI: https://github.com/pytorch/pytorch/issues/135927
And without the UTs, we can't ensure Windows inductor quality.

Intel team will do some local test for Windows inductor, but we still need to add a switch to turn on the full Windows inductor UTs.

The switch is an environment variable:
```cmd
set TORCHINDUCTOR_WINDOWS_TESTS=1
```
After setup this environment variable, we can turn on all Windows inductor UTs, It will not affect to PyTorch CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148733
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-03-10 18:25:29 +00:00
4c13a859e5 Workaround no triton float8_e8m0fnu support in inductor (#148722)
Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873.

The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722
Approved by: https://github.com/vkuzo
ghstack dependencies: #148450
2025-03-10 17:37:39 +00:00
cyy
203dd18c5c Bump Clang-tidy to 19.1.4 (#148648)
Because Clang-tidy 19 has more powerful clang-analyzer checks to detect subtle bugs. New checks such as misc-use-internal-linkage can help identify potential static variables or functions, thus reducing binary sizes.

Some new checks are disabled temporarily for later enabling. Additional warnings have been fixed or suppressed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148648
Approved by: https://github.com/Skylion007
2025-03-10 17:32:30 +00:00
ebd087e4b5 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit f08146b67bab331f7bdc9fa247f526f6e60a7190.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))
2025-03-10 17:19:21 +00:00
2ec9aceaeb Revert "Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)"
This reverts commit 3680e666d8ceaa43069555f821d1e8a5de01d5ab.

Reverted https://github.com/pytorch/pytorch/pull/148834 on behalf of https://github.com/janeyx99 due to sorry I don't think I want this PR in before the branch cut, as it'd freeze the API in the file when it should really be in a different header ([comment](https://github.com/pytorch/pytorch/pull/148834#issuecomment-2711162193))
2025-03-10 16:29:40 +00:00
9dbc2527dc Disable some SVE autovec (#148489)
Summary: autovec miscompiles on patterns of the type:
```cpp
for (const auto i : c10::irange())
```
Same issue as described in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 and addressed by https://github.com/pytorch/pytorch/pull/137795 for gcc, but not clang

Test Plan:
buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface

Differential Revision: D70422723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148489
Approved by: https://github.com/malfet
2025-03-10 16:25:00 +00:00
a60b4ed623 [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148288
2025-03-10 16:06:19 +00:00
8f858e226b [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261
2025-03-10 16:06:19 +00:00
5d4e7d58b4 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-10 16:06:11 +00:00
bf752c36da [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-10 16:06:02 +00:00
bec7bdad47 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-10 16:05:53 +00:00
cyy
b8b1b364c9 Fix invalid format string in libfmt calls (#148855)
Wrap shaderSource inside fmt::runtime because the format string is not a string literal and can't pass libfmt's compile time check in C++23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148855
Approved by: https://github.com/Skylion007
2025-03-10 14:47:52 +00:00
a81751d8b7 [CD] Annotate linux/arm64 cuda wheels with consistent nvidia dependencies (#145021)
This resolves issues installing torch nightly wheels into a `uv sync`-generated `.venv`

The root cause is that the x64 and arm64 cuda nightly wheels have inconsistent metadata. This can be seen comparing `generated-linux-aarch64-binary-manywheel-nightly.yml` and `generated-linux-binary-manywheel-nightly.yml`

`uv` expects consistency:

https://github.com/astral-sh/uv/issues/10693
>Frankly, it's really not ideal that they change their dependencies from wheel to wheel.
>They could still put the dependencies there with the same platform markers they're using in the other wheel though... 🤷‍♀

https://github.com/astral-sh/uv/issues/10119#issuecomment-2559898792
>I think this is something that basically has to be solved by PyTorch. The issue is that the wheels for `2.6.0.dev20241222+cu126` don't have consistent metadata, and it's a fundamental assumption of uv that the metadata for a given version _is_ consistent.

To resolve this, I modified the arm64 nightly build workflow to add two new `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under `manywheel-py3_11-cuda-aarch64-build` and `manywheel-py3_12-cuda-aarch64-build`. These are based on their equivalents in the x64 workflow for the corresponding python versions.

I used the cuda 12.6 dependencies versions for the nvidia packages, to match the `DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main` being used by these jobs.

(The arm64 workflow file already had several `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under various cpu wheels. I'm not sure why these are there, but I left them as-is.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145021
Approved by: https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-10 14:39:39 +00:00
4fdd076907 [CD] Add triton xpu as dependency of torch xpu windows whl (#148755)
Depends on PR #147637 land

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148755
Approved by: https://github.com/atalman
2025-03-10 14:04:30 +00:00
31625b08b8 Add ccode for FloorDiv (#148727)
Summary: Add ccode for FloorDiv

Test Plan: CIs

Differential Revision: D70749021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148727
Approved by: https://github.com/bobrenjc93
2025-03-10 14:00:18 +00:00
2068235c0a Add timm_efficientnet to flaky models after cuda 12.6 update in CI/CD (#148788)
After https://github.com/pytorch/pytorch/pull/148612
This model have become flaky

Tracking this regression in an issue : https://github.com/pytorch/pytorch/issues/148699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148788
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-03-10 13:40:41 +00:00
68c12ecfe2 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-10 13:17:58 +00:00
098494e9cb [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-10 13:14:05 +00:00
59f14d19ae Implement gradient for the residuals of torch.linalg.lstsq (#148526)
Fixes #147543.

I have written some tests in python using `gradcheck`. Please advise where I should put these tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526
Approved by: https://github.com/lezcano
2025-03-10 12:35:09 +00:00
ea86b8d315 Fix redistribution cost for all-reduce (#148761)
This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce.

Thanks @lw for the helpful discussions on getting this PR out!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761
Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin
2025-03-10 12:13:11 +00:00
526524b489 Update slow tests (#148873)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148873
Approved by: https://github.com/pytorchbot
2025-03-10 11:46:30 +00:00
74da76f67c [ROCm][Windows] Fix ROCm/HIP version header (#148560)
On Windows, ROCm libraries do not have a `<rocm-core/rocm_version.h>` header, which causes the compilation to fail. This PR resolves this problem by utilising `<hip/hip_version.h>`  from HIP SDK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148560
Approved by: https://github.com/jeffdaily
2025-03-10 11:28:13 +00:00
00199acdb8 [inductor][triton] Block ptr analysis fix assert on matched index expression (#148446)
If dynamic shapes are enabled, then block analysis may create new precomputed size replacements from the index which can lead to an assertion failure when the matched index is compared with the original index. For example the below assertion fails, despite the expressions being equivalent (ps2 = 3 * ps0). This can be resolved by updating the original index with the replacements, or simply removing the replacements when the expressions are tested to be equal - the latter option is implemented in this PR.

```
       torch._inductor.exc.InductorError: AssertionError:
E       Invalid match!
E       Index: 3*ps0*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E       Matched expression: ps2*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E
```

This PR fixes the test below when `config.triton.use_block_ptr=True`:
```
python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_conv3d_channels_last_dynamic_shapes_cpu
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148446
Approved by: https://github.com/jansel
2025-03-10 05:26:55 +00:00
3680e666d8 Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)
I noticed that this op was likely intended to be in the `extern "C"` portion of the file, but it was not added as such in https://github.com/pytorch/pytorch/pull/145250 which means this function is actually not stable/would get mangled by C++.

Following the thread there I am thinking there are two possible solutions:
(1) Since this op was never stable to begin with, and @Xia-Weiwen already landed the fallback, maybe this op is deletable + should get deleted before the 2.7 branch cut
(2) Or we could just move the op to the right portion of the code. While I like just deleting the op, I am hesitant to do in case there's something I haven't considered, so this PR does option 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148834
Approved by: https://github.com/desertfire
2025-03-10 03:23:48 +00:00
7ae0ce6360 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:36 +00:00
b47d81682d [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:24 +00:00
cyy
aac230a511 [MPS] Fix Wreorder-init-list (#148839)
Fixes the following warning:
```
 warning: ISO C++ requires field designators to be specified in declaration order; field 'value' will be initialized after field 'size' [-Wreorder-init-list]
    662 |       return {.value.cf = scalar.to<c10::complex<float>>(), .size = sizeof(int64_t), .type = type};
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148839
Approved by: https://github.com/Skylion007
2025-03-09 23:45:46 +00:00
b95889042c [MPS] Introduce strides unary op (#148468)
By adding following template
```metal
template <typename T, typename F>
kernel void unary_strided(
    device result_of<F, T>* output [[buffer(0)]],
    constant T* input [[buffer(1)]],
    constant long* sizes [[buffer(2)]],
    constant long* input_strides [[buffer(3)]],
    constant long* output_strides [[buffer(4)]],
    constant uint& ndim,
    uint index [[thread_position_in_grid]]) {
  F f;
  int pos[max_ndim];
  pos_from_thread_index(int(index), pos, sizes, ndim);
  const auto input_offs = offset_from_coord(pos, input_strides, ndim);
  const auto output_offs = offset_from_coord(pos, output_strides, ndim);
  output[output_offs] = f(input[input_offs]);
}
```
and instantiating it for all existing unary shaders, which eliminates the need to any intermediate copies.
No extra testing are needed as those cases are already covered by `test_output_grad_match_corrcoef_cpu_float32` as well as `test_unary_ops_storage_offset_strided`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148468
Approved by: https://github.com/dcci
2025-03-09 22:30:51 +00:00
275a7c5dbb Revert "Add a stable TORCH_LIBRARY to C shim (#148124)"
This reverts commit 327e07ac1dc3351bb5f0ad436760b83590c400aa.

Reverted https://github.com/pytorch/pytorch/pull/148124 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but somehow it caused test failures in newly introduced tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pull%20%2F%20linux-focal-cuda12.6-py3.10-gcc11-sm89%20%2F%20test%20(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148124#issuecomment-2709057833))
2025-03-09 20:44:56 +00:00
19a39a7a06 Revert "[dynamo] allow global import from collections import deque in user code (#148676)"
This reverts commit 685fb377131cc684633dc5471e77038988db53f6.

Reverted https://github.com/pytorch/pytorch/pull/148676 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see f1444f006c/1(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148676#issuecomment-2709057326))
2025-03-09 20:42:03 +00:00
f1444f006c [caffe2/torch] Fixup upstream LLVM (major version 21) API changes (#148833)
Latest LLVM introduced two changes related to the `Triple` usage that causes build failures when building pytorch.

## Failure in llvm_codegen.cpp:
Triple is stored in Modules instead of the string: 979c275097

## Failure in llvm_jit.cpp:
Triple argument is removed from LLJITBuilder::... : b18e5b6a36

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148833
Approved by: https://github.com/Skylion007
2025-03-09 18:58:36 +00:00
9a1a2e1516 Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
2025-03-09 17:12:47 +00:00
a8e3d1984a [Inductor UT][XPU] Skip test case test_cat_max_autotune_triton for known issue. (#148734)
The mm triton template/configs have not been tuned for XPU, we observer that the epilogue fusion can not speed up on XPU because of registers spill. So XPU failed on the case `test_cat_max_autotune_triton` which checks the fusion. We'll remove the skip after #146568 being resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148734
Approved by: https://github.com/jansel
2025-03-09 15:09:43 +00:00
bb9c426024 Typo Errors fixed in multiple files (#148262)
# Fix typo errors across PyTorch codebase

This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.

## Changes Made

### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
  - `setup.py`: Build system documentation
  - `torch/_library/triton.py`: AOT compilation comments
  - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
  - `torch/export/_unlift.py`: Pass population comments
  - `torch/export/exported_program.py`: Decomposition table notes

### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
  - `test/mobile/test_lite_script_module.py`: Exception handling comments
  - `torch/export/_draft_export.py`: Error message text
  - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
  - `torch/csrc/utils/python_numbers.h`: Overflow handling comment
  - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
  - `torch/_dynamo/symbolic_convert.py`: Error explanation

### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
  - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
  - `torch/distributed/distributed_c10d.py`

## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.

## Test Plan
No testing required as these changes only affect comments and documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-09 12:21:40 +00:00
327e07ac1d Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-03-09 10:07:25 +00:00
685fb37713 [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-09 09:35:29 +00:00
6566d67bd3 [dynamo] show stack above dynamo in graph break user tracebacks (#148401)
Also show the line of code relevant to a dynamo-compiled frame, instead of just the first line (this was broken for data-dependent jump graph breaks and for 3.11+).

Also collapses resume frames together (use config.verbose to see full stack trace - for developers).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148401
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-03-09 07:37:38 +00:00
2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
85fe576ee3 [set_linter] allow x in {...} (#148422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148422
Approved by: https://github.com/Skylion007
2025-03-09 06:43:11 +00:00
9cb25f0ea2 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 17dbeb11db7afbab792ad76c24840c1552a0e76d.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))
2025-03-09 03:01:55 +00:00
17dbeb11db [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-08 20:00:12 +00:00
5245304f1e Update decompositions_for_jvp.py (#148821)
small typo thing that got my eye

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148821
Approved by: https://github.com/Skylion007
2025-03-08 19:08:42 +00:00
148eb735ee Change nvcc arch flags for sm100 (#148774)
### Summary
- Addressing this comment https://github.com/pytorch/pytorch/pull/148274#discussion_r1984944012

### Test plan
- Verified building from source w/ B200s is successful
- Verified B200 tensorcores are still being utilized properly via benchmarking script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148774
Approved by: https://github.com/Skylion007
2025-03-08 19:05:53 +00:00
7ffadff286 c10d/ProcessGroup: cleanup abort and shutdown (#148798)
This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs.

This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation.

Test plan:

```
pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798
Approved by: https://github.com/kwen2501
2025-03-08 18:33:18 +00:00
9841f0ddcf Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566)
This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation.

It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do.

For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`.
Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566
Approved by: https://github.com/weifengpy
2025-03-08 18:00:49 +00:00
439782960c Fix typos in SpectralOps.cpp (#148818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148818
Approved by: https://github.com/Skylion007
2025-03-08 17:34:59 +00:00
eqy
849cc058ee [CUDA][TF32] Account for tf32 in test_efficient_conv_bn_eval (#148802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148802
Approved by: https://github.com/Skylion007
2025-03-08 16:17:04 +00:00
c3b05c4a27 [triton 3.3] support both specialize_impl and create_specialize_impl (#148806)
After https://github.com/triton-lang/triton/pull/6099, we sometimes need to do `from triton.runtime.jit import specialize impl` and sometimes do `triton.runtime.jit.create_specialize_impl()`. This should fix a bunch of the new errors that appeared with the triton 3.3 / pytorch 2.7 integration (e.g. `python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_float_arg_dynamic_False_cuda`, failing at https://hud.pytorch.org/pr/pytorch/pytorch/148684#38392501220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148806
Approved by: https://github.com/drisspg
2025-03-08 09:31:52 +00:00
118c9e501a [ONNX] Remove inaccurate test comment (#148813)
Remove the comment that says jit trace strategy doesn't support dynamic shapes as dict because it does support it (which is what the test is testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148813
Approved by: https://github.com/cyyever, https://github.com/titaiwangms
2025-03-08 08:55:56 +00:00
3745da18f4 [AOTI] Swith to local cpp compile for fbcode (#148592)
Summary: as title, otherwise we can not find lamdhip64

Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431

Differential Revision: D70637798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592
Approved by: https://github.com/hl475
2025-03-08 08:38:26 +00:00
666508eb17 [aot cache][ca] remove restriction on caching ca's aot inference graph (#148491)
but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148381
2025-03-08 06:08:26 +00:00
c16cd25cf5 [ca] remove compiled_autograd_tracing (#148381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381
Approved by: https://github.com/jansel
2025-03-08 06:08:26 +00:00
5f1c79ba2b [CD] Enable triton xpu windows build (#147637)
Depends on #147727, which introduce triton xpu windows support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147637
Approved by: https://github.com/atalman
2025-03-08 05:28:46 +00:00
cyy
f7c0c230b0 Fix compile errors (#148758)
Fix
```
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry'
     91 |         static_assert(sizeof(_Tp)>0,
        |                       ^~~~~~~~~~~
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here
    399 |           get_deleter()(std::move(__ptr));
        |           ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here
    200 | AliasDb::~AliasDb() = default;
        |          ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here
    200 | AliasDb::~AliasDb() = default;
        |                       ^
  ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry'
    298 |   struct WriteRegistry;
        |          ^
  1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758
Approved by: https://github.com/Skylion007
2025-03-08 04:56:42 +00:00
75179fd6e6 [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148781)
Differential Revision: D70575053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148781
Approved by: https://github.com/SherlockNoMad
2025-03-08 04:43:32 +00:00
8f71d4563e Fix rms_norm in fp16/bf16 (#147203)
Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done.

Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203
Approved by: https://github.com/eqy
2025-03-08 04:43:18 +00:00
85467ed063 Fix for AOTI + CUDAGraphs when calling from Python (#148601)
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601
Approved by: https://github.com/desertfire
2025-03-08 02:44:14 +00:00
9f170d9d13 [Triton 3.3] Remove ROCm specific mm gemm template (#148662)
Fixes: https://github.com/pytorch/pytorch/issues/147121
Since triton 3.3.x fixes the problem

Needs to be handled in none BC breaking way, so we will conditionalise this change on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148662
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2025-03-08 01:24:40 +00:00
a89e7c2da9 [Upstream] Wrap log_2_e in tl.constexpr for new 3.3 bump (#148785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148785
Approved by: https://github.com/davidberard98
2025-03-08 01:09:28 +00:00
179b7a0abc Do not crash when compiling quantized LORA models (#148435)
Fixes #148072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148435
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel
2025-03-08 00:02:08 +00:00
24085db082 Don't clear feedback_saver_fns after cache clear (#148723)
Summary:
Since feedback_saver_fns are used for logging, I don't think it makes sense to clear them, and this resulted in weird behavior in user code where disabling caches caused logging code to break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148723
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-03-07 23:43:59 +00:00
d96c85558a [ONNX] Use torch export to get dynamic shapes for JIT convert strategy (#148627)
Use torch export to get dynamic shapes for JIT converted graph. I just realized we can retrace a converted jit graph with `torch.export` and produce dynamic shapes using `torch.export`.

-	**Prior:** The exporter will produce a **static graph silently** even when dynamic_shapes are provided.
-	**Proposed:** When `dynamic_shapes` is provided and when the strategy is able to handle it, it will succeed

## Why are we still keeping the JIT strategy?

It is useful when users want to convert JIT modules or `.pt` files into ONNX via the new path. Sometimes also useful when there are JIT scripted modules in the nn module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148627
Approved by: https://github.com/titaiwangms
2025-03-07 23:41:50 +00:00
26f8d81037 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-07 23:35:47 +00:00
187d5c0eb1 [logging] Log cudagraphify timings to dynamo_timed (#143220)
Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note:
* Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries
* The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id
* I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`:
* tlparse: https://fburl.com/21yrdn8h
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz

`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/r9mp7uiv
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220
Approved by: https://github.com/eellison
2025-03-07 23:07:13 +00:00
f2dfe2d99c [Triton 3.3] [ROCm] Enabled split_scan support for ROCm builds (#147619)
Fixes issue https://github.com/pytorch/pytorch/issues/133228

Enabled split_scan support for ROCm builds.

Must be handled in a non BC breaking way so this functionality is enabled conditionalised on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147619
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: David Berard <davidberard98@gmail.com>
2025-03-07 23:06:21 +00:00
0f852641c2 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit d35a4ddae2345e639001bfee58a0932e96597f2d.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/henrylhtsang due to mistakes when writing the tests ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2707637965))
2025-03-07 22:42:13 +00:00
755965d2e4 [inductor] fix matmul w/ torch.bucketize epilogue (#148769)
See https://github.com/pytorch/pytorch/issues/148764.

Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist.

As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape.

This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it.

Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769
Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi
2025-03-07 22:34:13 +00:00
67742128b7 [ROCm] Bump AOTriton to 0.9.2b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-07 22:10:07 +00:00
7b79e17275 [BE] Move cuda12.6 builds to gcc11 (#148740)
I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/`

Which accidentally fixes  undefined symbol references errors namely
```
/usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()'
```
Which happens because `libmagma.a` that were build with gcc-11 (after https://github.com/pytorch/pytorch/pull/148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`)

Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or
```
$ echo "int* foo(int x) { return new int[x];}"|g++ -std=c++17 -S -O3 -x c++ -o - -
	.file	""
	.text
	.section	.text.unlikely,"ax",@progbits
.LCOLDB0:
	.text
.LHOTB0:
	.p2align 4
	.globl	_Z3fooi
	.type	_Z3fooi, @function
_Z3fooi:
.LFB0:
	.cfi_startproc
	endbr64
	movslq	%edi, %rdi
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movabsq	$2305843009213693950, %rax
	cmpq	%rax, %rdi
	ja	.L2
	salq	$2, %rdi
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	jmp	_Znam@PLT
	.cfi_endproc
	.section	.text.unlikely
	.cfi_startproc
	.type	_Z3fooi.cold, @function
_Z3fooi.cold:
.LFSB0:
.L2:
	.cfi_def_cfa_offset 16
	call	__cxa_throw_bad_array_new_length@PLT
	.cfi_endproc
```

Fixes https://github.com/pytorch/pytorch/issues/148728 and https://github.com/pytorch/pytorch/issues/148495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148740
Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-03-07 21:21:12 +00:00
08baaa7d63 [Docs][TunableOp] TunableOp documentation update (#148384)
This PR aligns documentation to what is in the README file:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md

and removes the prototype NOTE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384
Approved by: https://github.com/jeffdaily, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-03-07 21:02:49 +00:00
bb94b65da7 Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 2fb654676f6291f6e27c6bab2761f170516598dd.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2707440106))
2025-03-07 20:58:28 +00:00
d8dc700e25 Delete duplicate entry from docker-builds.yml (#148782)
Regression introduced by merge conflict of https://github.com/pytorch/pytorch/pull/148612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148782
Approved by: https://github.com/atalman
2025-03-07 20:55:46 +00:00
99da439d10 Revert "Remove Cuda 12.4 from nightly Binaries (#148625)"
This reverts commit 1239176fe717839ca5612ac03a4806051225f381.

Reverted https://github.com/pytorch/pytorch/pull/148625 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/148625#issuecomment-2707415005))
2025-03-07 20:47:45 +00:00
6602e632cd Suppress build warnings when gcc-11 is used (#148763)
By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")`
that will suppress following (when building against ancient llvm-9)
```
In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24:
/opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type*, llvm::Value*, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]':
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
 1581 |     return Insert(new LoadInst(Ty, Ptr), Name);
      |                   ^~~~~~~~~~~~~~~~~~~~~
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void* llvm::UnaryInstruction::operator new(size_t)'
```

Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman
ghstack dependencies: #148739
2025-03-07 20:43:35 +00:00
d36391307f [ONNX] Handle error in verification interpreter (#148730)
Use a simple try catch to handle onnx runtime errors in the verification interpreter when that happens. One example is ort will sometimes produce a list of None for some nodes. I am not sure how that happens yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148730
Approved by: https://github.com/titaiwangms
ghstack dependencies: #148706
2025-03-07 20:24:49 +00:00
aebd2e411f [pytree][easy] lock global registry containers properly for thread-safety (#148750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148750
Approved by: https://github.com/StrongerXi
2025-03-07 20:04:52 +00:00
6b44a91a62 use statically_known_true instead of guard_size_oblivious in pattern matcher (#147557)
We shouldn't add guards here. Use statically_known_true instead. Internal xref: https://fb.workplace.com/groups/1075192433118967/?multi_permalinks=1609560723015466&comment_id=1610040026300869&notif_id=1740082892544333&notif_t=work_feedback_reaction_generic&ref=notif

Differential Revision: [D69950122](https://our.internmc.facebook.com/intern/diff/D69950122/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147557
Approved by: https://github.com/eellison
2025-03-07 19:17:25 +00:00
b246cd7b82 Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 17302b4bc837af079d2f6480f07ea2c99b93fb4b.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))
2025-03-07 18:59:58 +00:00
1239176fe7 Remove Cuda 12.4 from nightly Binaries (#148625)
https://github.com/pytorch/pytorch/issues/145570

removes cuda 12.4 nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148625
Approved by: https://github.com/atalman
2025-03-07 18:56:04 +00:00
61c4074df7 Add Windows Arm64 Nightly Builds (#139760)
This PR creates 3 new worklflows for Windows Arm64 target. The workflows and outputs can be reviewed at the following links:
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-release-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-debug-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-wheel-nightly.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139760
Approved by: https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-03-07 18:53:56 +00:00
cyy
e839e4f5bd Fix Wc++98-compat-extra-semi (#148757)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148757
Approved by: https://github.com/Skylion007
2025-03-07 18:49:12 +00:00
0a7ccee1e0 [ROCm][Windows] Disable Composable Kernels and Triton for Windows builds (#147334)
Currently, Composible Kernels and Triton aren't available on Windows. This PR ensures that the files relating to this dependency are not included during the build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147334
Approved by: https://github.com/jeffdaily
2025-03-07 18:40:49 +00:00
eqy
18c6e00c7b [CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in ProcessGroupNCCL.cpp (#148594)
Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594
Approved by: https://github.com/kwen2501
2025-03-07 18:39:02 +00:00
9769618d35 [CI] [inductor] Add cu126 inductor jobs and move away cu124 (#148612)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793 into eager and inductor benchmarks to unblock

Seems many inductor yml are added after initial change was prepared.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148612
Approved by: https://github.com/nWEIdia, https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 18:30:14 +00:00
da923afdc7 [MPS][BE] Align bitshift behavior with CPU (#148719)
By casting the argument to output type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148719
Approved by: https://github.com/Skylion007
ghstack dependencies: #148685, #148686
2025-03-07 18:28:14 +00:00
f84710aef4 [MPS] Fix scalar to tensors bitshifts (#148686)
By introducing a concept of non-commutative binary op and renaming all op templates from `bitwise_foo_tensor` and `bitwise_foo_scalar` to `bitwise_foo_tensor_tensor` and `bitwise_foo_tensor_scalar`

Add regression tests

Please note, that for some undefined values MPS and CPU behaviors are different, for example
```
>>> import torch
>>> 4095 >> torch.arange(12, device="mps", dtype=torch.uint8)
tensor([255, 255, 255, 255, 255, 127,  63,  31,  15,   7,   3,   1],
       device='mps:0', dtype=torch.uint8)
>>> 4095 >> torch.arange(12, device="cpu", dtype=torch.uint8)
tensor([255, 127,  63,  31,  15,   7,   3,   1,   0,   0,   0,   0],
       dtype=torch.uint8)
```
Because on CPU scalar is cast to output dtype before operation is performed, but on MPS this happens after the op is done

Fixes https://github.com/pytorch/pytorch/issues/147889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148686
Approved by: https://github.com/albanD
ghstack dependencies: #148685
2025-03-07 18:28:14 +00:00
cyy
116c1e42c5 Re-enable tests (#148732)
No UBSAN failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148732
Approved by: https://github.com/Skylion007
2025-03-07 18:11:57 +00:00
8059ead823 [ROCm] Incorporate ROCm triton specific tuning parameters (#148437)
Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.

A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437
Approved by: https://github.com/eellison, https://github.com/jansel
2025-03-07 18:09:47 +00:00
a3b77d434a Subprocess compile (attempt 2) (#148635)
Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Fixed the test which caused the previous version (#146134) to be reverted:
```
$ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635
Approved by: https://github.com/jamesjwu
2025-03-07 17:50:14 +00:00
50c9f6d83b [Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323)
In `fresh_inductor_cache` remove pyd files will raise permission error
on Windows because they are still used by the process.
So we clear the references to the loaded pyd libray obj and unload them
from the process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323
Approved by: https://github.com/jansel
ghstack dependencies: #148534, #148538, #147727
2025-03-07 17:19:59 +00:00
d05694807d [XPU][Inductor] Update Intel triton for release 2.7. (#147727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147727
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
ghstack dependencies: #148534, #148538
2025-03-07 17:19:59 +00:00
136b8165d1 [DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577)
Summary: Save Plan Caching: Fix the missing all_plans update in the cache.

Test Plan:
```
buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test
```

https://www.internalfb.com/intern/testinfra/testrun/17451448626323264

Reviewed By: MeetVadakkanchery

Differential Revision: D70229019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577
Approved by: https://github.com/MeetVadakkanchery
2025-03-07 17:00:59 +00:00
abcca2fcbb Revert "Fix torch.nn.functional.hardswish gradients corner case (#148049)"
This reverts commit 29b28e9d9f93d78092099a44a7bcc28cfbae06e3.

Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))
2025-03-07 16:05:56 +00:00
17302b4bc8 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-07 15:19:34 +00:00
d54b2b7fa7 [BE] Delete split builds (#148739)
They has been disabled since Oct 2024, perhaps time to remove them from the workflows

See https://github.com/pytorch/pytorch/issues/138750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148739
Approved by: https://github.com/atalman
2025-03-07 15:10:50 +00:00
372ad7b181 Enable FSDP2 on HPU device (#148667)
The motivation of this PR is to enable FSDP2 collectives for HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667
Approved by: https://github.com/wconstab
2025-03-07 14:33:43 +00:00
81847d08cf [Intel GPU][quant] Refine zero-point memory creation (#148640)
# Motivation
This PR skips  zero-point GPU memory creation when zero-point=0, as it would not be used by oneDNN library. This could help save the 1~3 H2D copy overhead per QLinear/QConv kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148640
Approved by: https://github.com/liangan1, https://github.com/EikanWang
2025-03-07 13:49:19 +00:00
f80aad62fa Improve Pareto frontier plot for AutoAC (#148678)
This was added in https://github.com/pytorch/pytorch/pull/126320. It's a very nice feature, which can be used to predict memory usage for different budget values.

However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings.

Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148678
Approved by: https://github.com/Chillee, https://github.com/fmassa
2025-03-07 13:22:29 +00:00
d4d7d813fa Update CURL url for manywheel images (#148343)
It looks like it was moved on the site it was downloaded from.
Switch to official site while updating URL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148343
Approved by: https://github.com/dr4gon01, https://github.com/janeyx99, https://github.com/atalman, https://github.com/seemethere
2025-03-07 11:41:12 +00:00
6cf360be04 fix lost input mutations with export_tracepoint (#148709)
Preserving module call signatures in the presence of input mutation cause incorrect results. The root cause turned out to be that export tracepoints would unwrap / wrap functional args that would lose mutation info on those args.

Differential Revision: [D70734821](https://our.internmc.facebook.com/intern/diff/D70734821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148709
Approved by: https://github.com/angelayi
2025-03-07 09:36:18 +00:00
bb84a23c22 [ROCm] [TunableOp] Enable logging of BLAS parameters (#147034)
This PR supports a logging feature that is being requested.
```
PYTORCH_TUNABLEOP_BLAS_LOG=1
```
Enables the logging of BLAS parameters with either offline of online (in-situ) tuning.

The BLAS parameters are written to the CSV file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147034
Approved by: https://github.com/jeffdaily
2025-03-07 09:32:59 +00:00
243b47e2ec [Intel GPU] Fix SDPA dummy LSE output to match meta function (#148652)
To fix XPU patched UTs including
```bash
pytest -vs third_party/torch-xpu-ops/test/xpu/test_meta_xpu.py::TestMetaXPU::test_dispatch_symbolic_meta_outplace_nn_functional_scaled_dot_product_attention_xpu_bfloat16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148652
Approved by: https://github.com/EikanWang
2025-03-07 08:36:18 +00:00
416ea1c71c Code Clean: Remove unnecessary code (#148735)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148735
Approved by: https://github.com/jingsh, https://github.com/cyyever
2025-03-07 08:15:37 +00:00
4075646bd8 Use oneDNN v3.7.1 for Intel GPU (#148403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148403
Approved by: https://github.com/EikanWang

Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-03-07 08:03:49 +00:00
cyy
3d854ea9bd Remove deprecated std::aligned_storage_t (#148660)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148660
Approved by: https://github.com/swolchok
2025-03-07 07:29:42 +00:00
3f069e7679 [mm_logs] enhance the printing for overview info (#148716)
Summary:
previously the dynamo counters does not print the counts information automatically.

explicitly added a log msg to print after lowering for overview info for inductor aten mms

it will look like:

the name is in `{aten_op_name}_{m}_{n}_{k}`
```
torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx
```

 {F1975874802}

Test Plan:
```
TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```

Differential Revision: D70739912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716
Approved by: https://github.com/henrylhtsang
2025-03-07 05:23:49 +00:00
5f392ae560 Throws error when using torch.cuda.MemPool with expandable segments (#148378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #148374
2025-03-07 05:22:03 +00:00
c0f1557285 [FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715)
we got asked a few times about FSDP2's equivalence of no_sync. highlight
set_requires_gradient_sync as the equivalence in docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715
Approved by: https://github.com/mori360
2025-03-07 04:34:46 +00:00
fe4b88f6aa [HPU] Add hpu to fused kernels supported devices (#148666)
This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring
compatibility with HPU backend.

Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with:

RuntimeError: fused=True requires all the params to be floating point Tensors
of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone']
but torch.float32 and hpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666
Approved by: https://github.com/albanD
2025-03-07 04:28:33 +00:00
33f8ab2f58 [ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238)
This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM.

There are few other items included in this PR as well:
- Fixes for offline tuning of scaled GEMM
- Simplification of existing offline UT
- Update existing online UT to also test rowwise versus tensorwise scaled GEMM
- New UT for offline scaled GEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238
Approved by: https://github.com/jeffdaily
2025-03-07 04:12:48 +00:00
cdb4fd0d29 Update win-vs2022-cuda12.1-py3 -> win-vs2022-cuda12.6-py3 (#148717)
Should have been migrated long ago
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148717
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-03-07 03:21:29 +00:00
389b496062 [XPU] Add test/kernel.errors.txt to .gitignore. (#148538)
Intel GPU user mode driver may generate kernel.errors.txt files in
current working directory in certain scenarios. It includes diagnostic
information but does necessarily indicates the issue with an
application. This is a known issue and will be fixed in newer version of driver.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148538
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #148534
2025-03-07 03:12:50 +00:00
9c9b05bc4f Expose functions used in custom backend in torch_python dll (#148213)
Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL).

Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file.

Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h.

Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213
Approved by: https://github.com/malfet
2025-03-07 02:34:37 +00:00
dfb4094b9c Skip buffer in dense update (#148533)
Summary:
as title.

PyTorch Module buffer will not be published in delta publishing.  In Quinn's previous diff, constant type annotations have been introduced.

In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list

Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn

Differential Revision: D69553929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533
Approved by: https://github.com/22quinn, https://github.com/jingsh
2025-03-07 01:59:58 +00:00
00cd6c07b9 [Intel GPU][pt2e] Enable quantized grouped convolution at XPU (#148522)
# Motivation&Details
This PR fix a bug that blocked quantized group convolution before. The bug is caused by that, grouped convolution requires setting weight scale mask on both group dimension and output channel dimension. This PR fixs the wrong mask in integration and add grouped conv in UT.

# UT
` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv2d_xpu`

# Runtime exemplification
```onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:s8::blocked:acdb::f0 wei:s8::blocked:abcde::f0 bia:f32::blocked:a::f0 dst:f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:3:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,g4mb1_ic128oc128_ih4oh2kh3sh1dh0ph0_iw4ow2kw3sw1dw0pw0,0.0529785``
The verbose shows that we successfully run into quantized convolution, where weight is `abcde` format(group conv).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148522
Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/jansel
ghstack dependencies: #148423
2025-03-07 01:57:45 +00:00
127bd5a02d Add sparsity (#148513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148513
Approved by: https://github.com/danielvegamyhre
2025-03-07 01:47:52 +00:00
b4430c3a6d [Intel GPU][pt2e]: Collapse 3D input to 2D for matmul in qlinear_pointwise_binary fusion (#148423)
# Motivation
During the `qlinear_pointwise_binary` lowering pass, dim collapsing only occurs when post-ops is `add`. It is the responsibility of  C++ kernels to handle dimension for post-ops `sum`

# Details
This PR explicitly reshape input from 3D to 2D in op `qlinear_pointwise_binary`. Besides, we refractor implementation `qlinear_pointwise_binary.tensor` to call `qlinear_pointwise_binary` for removing duplicated codes.

# UT testing
`python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlienar_add_xpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148423
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-03-07 01:47:33 +00:00
c8cd8f68bd [dynamo] Properly account for non-list instances in list comparison (#148470)
As title; this patch also removes an unused `list_compare` method.

Fixes #148179.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148470
Approved by: https://github.com/anijain2305
2025-03-07 01:29:30 +00:00
a7fe685be8 Add cpp wrapper skip to cudagraph logs (#148700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148700
Approved by: https://github.com/jbschlosser
2025-03-07 01:02:40 +00:00
e3087f6d76 [ONNX] Improve verify_onnx_program to use VerificationInterpreter (#148706)
I realized we can just extend `verify_onnx_program` to return intermediate values. There is no need for us to expose the VerificationInterpreter to users.

I added a `compare_intermediates` option to `verify_onnx_program`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148706
Approved by: https://github.com/titaiwangms
2025-03-07 00:40:54 +00:00
cyy
50eb4f3990 Enable UBSAN test (#147511)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147511
Approved by: https://github.com/colesbury
2025-03-07 00:35:32 +00:00
33a285379a [codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501
Approved by: https://github.com/Skylion007
2025-03-07 00:33:39 +00:00
a0bc6d81bb [CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests (#148602)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793/ into eager and inductor benchmarks to unblock

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148602
Approved by: https://github.com/atalman, https://github.com/malfet

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 00:15:04 +00:00
e2a0296e80 [dtensor] add CuDNN SDPA op support to DTensor (#148537)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537
Approved by: https://github.com/drisspg, https://github.com/fegin
2025-03-06 23:44:40 +00:00
3960f97832 Documents torch.cuda.MemPool API (#148374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148374
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-03-06 23:18:43 +00:00
ed9c8a5d13 ROCm: Disable torch check for Multiplication of two Float8_e5m2 matrices (#148228)
ROCm supports Multiplication of two Float8_e5m2 matrices.
Hence disabling the torch check for ROCm.
Test command (on ROCm h/w supporting fp8)
python test/test_matmul_cuda.py TestFP8MatmulCudaCUDA.test_float8_basics_cuda -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148228
Approved by: https://github.com/jeffdaily, https://github.com/petrex
2025-03-06 22:12:45 +00:00
e6800bda7f [Test][Linalg][CUDA] Increase niter in test_svd_lowrank_cuda_float64 (#145930)
A recent PR #143049 attempted to increase tolerances to make test passable. However, we are still seeing errors like:
```
Traceback (most recent call last):
  File "~git/pytorch/test/test_linalg.py", line 2540, in test_svd_lowrank
    run_subtest(None, size, (), device, torch.svd_lowrank, density=density)
  File "~git/pytorch/test/test_linalg.py", line 2505, in run_subtest
    self.assertEqual(A, a, rtol=1e-7, atol=2e-7)
  File "~git/pytorch/torch/testing/_internal/common_utils.py", line 4044, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 90 / 1000000 (0.0%)
Greatest absolute difference: 7.795904016052784e-07 at index (176, 930) (up to 2e-07 allowed)
Greatest relative difference: inf at index (6, 179) (up to 1e-07 allowed)
```
Increasing `niter` parameter actually decreases numerical differences.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145930
Approved by: https://github.com/ngimel
2025-03-06 22:10:53 +00:00
75d29443e7 [Docs] update bucketize documentaion (#148400)
Fixes #144504

Clarify the documentation for `torch.bucketize` by referencing the existing table. The current version includes a somewhat confusing explanation for the `right` kwarg, whereas the existing table is much clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148400
Approved by: https://github.com/benjaminglass1, https://github.com/eellison, https://github.com/albanD
2025-03-06 22:07:52 +00:00
2fb654676f [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:26 +00:00
d35a4ddae2 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:19 +00:00
5a5ac98918 [aarch64] add libcufile for cu126 and cu128 (#148465)
seeing `  File "/usr/local/lib/python3.12/site-packages/torch/__init__.py", line 411, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory` with arm cu128 nightly.
related to https://github.com/pytorch/pytorch/pull/148137
need to copy the dependency for arm build as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148465
Approved by: https://github.com/atalman, https://github.com/abhilash1910
2025-03-06 21:39:43 +00:00
3d62e81a1e [DCP] fix dcp gather_object/scatter_object_list (#147675)
gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675
Approved by: https://github.com/MeetVadakkanchery
2025-03-06 21:20:38 +00:00
1d7fc0c681 [dynamo] Remove dead code path around functools.partial objects (#148683)
This removes the code paths added in #98120, which has then been
superceded by #108846.

More importantly, it makes `EQUALS_MATCH`'s `ok_mutable_types` (added in #134016)
easier to reason about, i.e., no need to worry about `dict` types, which
was only needed for #98120.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148683
Approved by: https://github.com/yanboliang
2025-03-06 21:20:04 +00:00
262411e48b [inductor] online softmax (#127011)
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row

When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.

Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).

Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54

## Microbenchmark

- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
  - eager_ms=6.671296119689941
  - opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
  - eager_ms=6.634047985076904
  - opt_ms=6.230591773986816

Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
2025-03-06 21:07:18 +00:00
cf9efbdf16 Revert "Enable onednn in pytorch for ppc64le architecture (#143743)"
This reverts commit d4cf0e5af406239881acfeb4f9e4f62373faca8b.

Reverted https://github.com/pytorch/pytorch/pull/143743 on behalf of https://github.com/davidberard98 due to windows build failures look related [GH job link](https://github.com/pytorch/pytorch/actions/runs/13705127978/job/38329845095) [HUD commit link](d4cf0e5af4) ([comment](https://github.com/pytorch/pytorch/pull/143743#issuecomment-2704903253))
2025-03-06 20:47:57 +00:00
1add61c242 Replace unimplemented with unimplemented_v2' in codegen.py` (#148069)
Fixes #147913

- replace `unimplemented` in `codegen.py`
- remove unused import `unimplemented`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148069
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-03-06 20:42:37 +00:00
edd640a95a [BE][Ez]: Use itertools.chain.from_iterable when possible (#148190)
Often makes the code more readable, more efficient, and adds support for infinite iterables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190
Approved by: https://github.com/jansel, https://github.com/malfet
2025-03-06 20:37:06 +00:00
65dbc3b454 [BE][MPS] Remove redundant handle_tensor_scalar_binary_op (#148685)
After https://github.com/pytorch/pytorch/pull/143934 `mtl_setBuffer` can handle scalar tensors correctly, so no need to have a specialized function here
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148685
Approved by: https://github.com/dcci
2025-03-06 19:24:46 +00:00
29b28e9d9f Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-06 19:04:52 +00:00
f08146b67b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-06 18:59:02 +00:00
96176e32a9 Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433)"
This reverts commit 8af79b7ec816f5c73536a806aa4c7ea1f7bd3867.

Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))
2025-03-06 18:32:48 +00:00
b85ae06bed Update CPU tolerance for f16 triplet margin loss (#147742)
Currently, the `test_torchinductor_opinfo` test for `nn.functional.triplet_margin_loss` fails on AArch64, this PR increases the acceptable ATOL and RTOL for this test when using F16. There is precedent for this as XPU and CUDA already increase the tolerance. Additionally, the CPU backend increases the tolerance for the `with_distance_loss` variant of `triplet_margin_loss`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147742
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 18:09:43 +00:00
d10bacd4ce [AOTI][dashboard] Skip torchbench models not supported by export (#148359)
Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359
Approved by: https://github.com/angelayi, https://github.com/ysiraichi
2025-03-06 18:08:17 +00:00
d91a634edf [c10d] Make getDefaultBackend more fault tolerant (#148596)
This is a forward fix for #135338.
It hits error like this:
```
"distributed_c10d.py", line 2156, in destroy_process_group
    if type(pg) == ProcessGroup and pg._has_hooks():
RuntimeError: Could not find the default backend type 0 for Process Group with name undefined.
```

When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call.

The fix wraps `getDefaultBackend` with a try-catch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596
Approved by: https://github.com/LucasLLC, https://github.com/fduwjj
2025-03-06 18:07:43 +00:00
d4cf0e5af4 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet, https://github.com/albanD
2025-03-06 18:00:55 +00:00
097b0d372a [pytree] fix previously failed dynamo tests (#148669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148669
Approved by: https://github.com/zou3519
2025-03-06 17:59:29 +00:00
28b68b46bc Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 4aeca28137dcee74b5fcd0c0636d0ee1f113d5fb.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2704534995))
2025-03-06 17:45:49 +00:00
3cde4c3069 [BE] Remove onlyCPU decorator from test_local_scalar_dense (#148559)
Followup from https://github.com/pytorch/pytorch/pull/145717, not sure why author thinks those tests should be limited to one architecture.
And fixed similar crashes for CUDA and MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148559
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/seemethere
2025-03-06 17:43:02 +00:00
841451af9f Revert "[Inductor] Avoid tensor slice overflow for large step (#147433)"
This reverts commit 1d7397a2d04a4d636559f41511a20f7dadbe5777.

Reverted https://github.com/pytorch/pytorch/pull/147433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/147433#issuecomment-2704506627))
2025-03-06 17:33:08 +00:00
679e7d257e [mm_logs] follow up to add count info based on shape for inductor aten.mms (#148623)
Summary:
as title.

when enable `TORCH_LOGS="+inductor"`, you can get logs at the end such as

stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('benchmarking.TritonBenchmarker.benchmark_gpu', 2), **(('aten_addmm', (16, 6, 16)), 1)**, ('extern_calls', 1), ('async_compile_cache_miss', 1)]
graph_break []

Test Plan: follow up to add proper logging test.

Differential Revision: D70665104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148623
Approved by: https://github.com/henrylhtsang
2025-03-06 16:20:04 +00:00
b160dda743 cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403)
This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode.

Changes:

1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions.
2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns.
3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive.

The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression.

Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403
Approved by: https://github.com/desertfire
2025-03-06 16:08:16 +00:00
5fb0f45d3b [triton 3.3] test_triton_kernel_constants fix (#148626)
Thanks @FindHao who did the initial version of this PR: https://github.com/pytorch/pytorch/pull/148505

TL;DR is that https://github.com/triton-lang/triton/pull/5961 deprecates `tl.constexpr` annotations - you're supposed to wrap the constexpr value in `tl.constexpr()` instead.

This just updates the tests to wrap with `tl.constexpr()` (and leaves the annotations - that way the old triton versions will still pass).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148626
Approved by: https://github.com/FindHao
2025-03-06 14:18:21 +00:00
d5184901c4 Make torch.serialization.skip_data work with torch.load (#148018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148018
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787, #147788
2025-03-06 12:04:46 +00:00
be0ceee1c3 Make record/storage alignment in torch.save configurable (#147788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787
2025-03-06 12:04:46 +00:00
209977e6e5 Add information about checkpoint offset to untyped storages when torch.load under FakeTensorMode (#147787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147787
Approved by: https://github.com/albanD
ghstack dependencies: #147786
2025-03-06 12:04:39 +00:00
bdcc1b579b Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786)
This only fixes _rebuild_tensor_v2 and _rebuild_tensor_v3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147786
Approved by: https://github.com/albanD
2025-03-06 12:04:32 +00:00
79aa17489c [dynamo] ctx_manager.py: replace unimplemented with unimplemented_v2 (#148570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148570
Approved by: https://github.com/williamwen42
ghstack dependencies: #148454
2025-03-06 07:46:31 +00:00
e7bc1d1791 [ONNX] Update saved exported program in debugging report if the exporting passes run_decomposition() (#148617)
Previous to this PR, if the exporting passes run_decomposition(), the report still shows the exported_program before decomposition, which adds the difficulties to our users when they want to check the exported program that are used to translate to ONNX graph.

The following example is what we see before this PR:

```
# PyTorch ONNX Conversion Report

```
 Obtain model graph with `torch.export.export(..., strict=False)`
 Obtain model graph with `torch.export.export(..., strict=True)`
 Obtain model graph with `torch.jit.trace`
 Decompose operators for ONNX compatibility
 Translate the graph into ONNX
 Run `onnx.checker` on the ONNX model
 Execute the model with ONNX Runtime
 Validate model output accuracy
```

## Error messages

```pytb

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 707, in _translate_fx_graph
    _handle_call_function_node_with_lowering(

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 486, in _handle_call_function_node_with_lowering
    raise _errors.DispatchError(

torch.onnx._internal.exporter._errors.DispatchError: No ONNX function found for <OpOverload(op='aten.slice', overload='Tensor')>. Failure message: No decompositions registered for the complex-valued input

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1371, in export
    onnx_program = _exported_program_to_onnx_program(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1007, in _exported_program_to_onnx_program
    values = _translate_fx_graph(
             ^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 733, in _translate_fx_graph
    raise _errors.ConversionError(

torch.onnx._internal.exporter._errors.ConversionError: Error when translating node %slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%_to_copy, 0, 0, 9223372036854775807), kwargs = {}). See the stack trace for more information.

```

## Exported program

```python
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[3, 4]"):
             # File: /home/titaiwang/pytorch/test_slice_complex.py:6 in forward, code: x_complex = x.to(torch.complex64)
            to: "c64[3, 4]" = torch.ops.aten.to.dtype(x, torch.complex64);  x = None

             # File: /home/titaiwang/pytorch/test_slice_complex.py:8 in forward, code: return x_complex[:, :2]
            slice_1: "c64[3, 4]" = torch.ops.aten.slice.Tensor(to, 0, 0, 9223372036854775807);  to = None
            slice_2: "c64[3, 2]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 2);  slice_1 = None
            return (slice_2,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='slice_2'), target=None)])
Range constraints: {}

```

## Analysis

PyTorch ONNX Conversion Analysis

## Model Information

The model has 0 parameters and 0 buffers (non-trainable parameters).
Number of parameters per dtype:
```python
defaultdict(<class 'int'>, {})
```
Number of buffers per dtype:
```python
defaultdict(<class 'int'>, {})
```

Inputs:
- `x`: `TensorMetadata(shape=torch.Size([3, 4]), dtype=torch.float32, requires_grad=False, stride=(4, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})`

Outputs:
- `slice_2`: `TensorMetadata(shape=torch.Size([3, 2]), dtype=torch.complex64, requires_grad=False, stride=(4, 1), memory_format=None, is_quantized=False, qparams={})`

The FX graph has 5 nodes in total. Number of FX nodes per op:
- `placeholder`: 1
- `call_function`: 3
- `output`: 1

Of the call_function nodes, the counts of operators used are:

- `aten.slice.Tensor`: 2
- `aten.to.dtype`: 1

## ONNX Conversion Information

The model contains operators the dispatcher could not find registered ONNX decompositions for. This may be due to missing implementations, decompositions not registered correctly, or a bug in the dispatcher.

Errors grouped by operator:

- `aten.to.dtype`:     No decompositions registered for the real-valued input. Example node: `%to : [num_users=1] = call_function[target=torch.ops.aten.to.dtype](args = (%x, torch.complex64), kwargs = {})`. All nodes: `[to]`
- `aten.slice.Tensor`:     No decompositions registered for the complex-valued input. Example node: `%slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%to, 0, 0, 9223372036854775807), kwargs = {})`. All nodes: `[slice_1, slice_2]`

## Decomposition comparison

Ops exist only in the ExportedProgram before decomposition: `['aten.to.dtype']`

Ops exist only in the ExportedProgram after decomposition: `['aten._to_copy.default']`

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148617
Approved by: https://github.com/justinchuby
2025-03-06 07:03:45 +00:00
ae6bb58483 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit ad49cfc9f0a8a4d8881b3734edd8c33a087c8b97.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/davidberard98 due to broke lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/13690720601/job/38283359447) [HUD commit link](ad49cfc9f0) ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2702980028))
2025-03-06 06:59:39 +00:00
4dc956a1d8 [Inductor][Triton] Fix test_autotune_inplace_kernel to work with newer Triton version (#148595)
For new Triton version 3.3, constexpr are included as part of the signature. Update failing test to reflect this change, additional context in https://github.com/pytorch/pytorch/pull/145051.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148595
Approved by: https://github.com/davidberard98
2025-03-06 05:37:08 +00:00
1fac47702e [Break XPU][Inductor UT] Generalize device-bias code introduced by #146866. (#148534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148534
Approved by: https://github.com/nandesuka
2025-03-06 04:39:50 +00:00
f057206fca [ONNX] Support complex comparison when verify=True (#148619)
Previously, the comparison of complex numbers was not supported when `verify=True`.

NOTE: This PR can be extended to support more complex comparison cases if there are other places in onnx codebase needed to be changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148619
Approved by: https://github.com/justinchuby
2025-03-06 04:38:43 +00:00
8b65d522e1 refactor delayed compile to use code context (#148530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148530
Approved by: https://github.com/williamwen42
ghstack dependencies: #148509
2025-03-06 04:02:30 +00:00
ad49cfc9f0 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 03:42:55 +00:00
02e1580e39 [MPS] fix crash for mse loss with 0 numel inputs (#148608)
Fixes #148589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148608
Approved by: https://github.com/malfet
2025-03-06 03:32:34 +00:00
8728d4b815 Clear triton kernels after parent make_launcher (#148604)
Before, we were clearing the cache only after inductor compile. But inductor may not **always** compile, i.e. on AOTAutogradCache hit.

So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856

Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604
Approved by: https://github.com/masnesral
2025-03-06 03:28:38 +00:00
cyy
1433bc1455 Remove CAFFE2_USE_EXCEPTION_PTR (#147247)
The check is for older compilers and is now aways true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147247
Approved by: https://github.com/janeyx99
2025-03-06 02:56:23 +00:00
43e1284c96 Fix empty matrix handling of addmv in inductor (#143792)
This is a resubmission of my previous PR that I accidentally deleted, apology in advance if any inconvenience caused. Below are details of this PR.

Fix an issue when torch.addmv behaves inconsistent between torch.compile mode and eager mode. Here is the code to reproduce:

```
import torch
import numpy as np

@torch.compile
def test_optimized(input, mat, vec):
    return torch.addmv(input, mat, vec)

def test(input, mat, vec):
    return torch.addmv(input, mat, vec)

input = torch.tensor([2], dtype=torch.int32)
mat = torch.tensor(np.random.randn(0, 0), dtype=torch.int32)
vec = torch.tensor([])
origin_out = test(input, mat, vec)
optimized_out = test_optimized(input, mat, vec)
print(origin_out)  # tensor([2.])
print(optimized_out)  # tensor([])
```

According to the equation (https://pytorch.org/docs/stable/generated/torch.addmv.html), when matrix and vector is empty, returning `[2.]` seems more reasonable to me.

Following the cpu implementation of this API:e97b97af56/aten/src/ATen/native/Blas.cpp (L62)

I add an additional branch to handle empty matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143792
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 02:09:27 +00:00
38b3375a81 [MTIA] Use "ieee" instead of "tf32" for MTIA's default precision in FlexAttention (#148565)
Summary: MTIA supports ieee but not tf32, so we set the default precision of MTIA to ieee similar to how it's done for AMD.

Test Plan: CI

Reviewed By: mortzur

Differential Revision: D70072064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148565
Approved by: https://github.com/mortzur
2025-03-06 02:07:18 +00:00
32715a2311 [inductor][ck] add kBatch_sweep to config.rocm (#148223)
Summary:
# Why

enable testing and users to specify a set of kBatches to try rather than relying on our hand written heuristic

# What

add rocm.kBatch_sweep as a list of kBatches to try out. These will generate a product of CK instances, one per kBatch for each existing op, though they are often filtered out if they are likely to fail at runtime

Test Plan: n/a

Reviewed By: chenyang78

Differential Revision: D70226055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148223
Approved by: https://github.com/ColinPeppler
2025-03-06 01:14:33 +00:00
63fbc738dc [Easy/Profiler] Add last entry to truncated values (#148576)
Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata

Test Plan:
Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D70637461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576
Approved by: https://github.com/davidberard98
2025-03-06 01:14:15 +00:00
23441492f6 [scan] Refactoring of input checking and dynamo invocation (#142125)
This PR does a refactoring of the way dynamo is invoked and how the input shapes are checked for scan and for associative_scan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142125
Approved by: https://github.com/ydwu4
2025-03-06 01:06:54 +00:00
6cc3e69103 [inductor] use eager stride for custom op if no tags (#148367)
Fix https://github.com/pytorch/pytorch/issues/148356

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: https://github.com/pytorch/pytorch/pull/148104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148367
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-03-06 00:58:00 +00:00
703176e538 [ROCm] Fix sort for non-standard bool (#147459)
When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true.

In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool.

Fixes #139972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-03-06 00:23:02 +00:00
690fc2c876 Add aot_eager_then_compile stance (#148509)
Sometimes `eager_then_compile` stance isn't enough since some models are so close to the memory limit that going to eager will OOM since we don't get the memory reductions from activation checkpointing. This PR introduces `aot_eager_then_compile` which avoids the expensive inductor compile, but still does aot_eager to get the benefits of memory reduction in the first invocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148509
Approved by: https://github.com/williamwen42
2025-03-05 23:23:45 +00:00
d6d670ab4d [AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587)
In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits).

Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587
Approved by: https://github.com/desertfire
2025-03-05 22:47:46 +00:00
897fd9b514 Revert "Subprocess compile (#146134)"
This reverts commit 07f876e9602ec6881df2360ab4817e129b563b7c.

Reverted https://github.com/pytorch/pytorch/pull/146134 on behalf of https://github.com/malfet due to looks like it broke slow jobs, see e1dee4ccb3/3 ([comment](https://github.com/pytorch/pytorch/pull/146134#issuecomment-2702239123))
2025-03-05 22:41:19 +00:00
e1dee4ccb3 [ONNX] Assert capture strategy in tests (#148348)
Previously the strategy used for obtaining the exported program is not asserted. This leads to silent errors if torch.export breaks something and a fallback strategy is used. This change adds a _capture_strategy field to ONNXProgram and enables unit tests to assert the strategy used to prevent fallbacks from happening.

Fixes #147674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148348
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1
2025-03-05 22:31:54 +00:00
5ccd659c0e Fix decomp for linspace (#147997)
In python decompositions, we shouldn't do any non-functional operations for functional operators. This should go away once we start decomposing before functionalization.

Differential Revision: [D70265200](https://our.internmc.facebook.com/intern/diff/D70265200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147997
Approved by: https://github.com/zou3519
2025-03-05 22:10:08 +00:00
9e755a1c03 [ROCm] add gfx12 to nightly wheels (#148562)
Adds gfx1200 and gfx1201 to PYTORCH_ROCM_ARCH for wheels and libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148562
Approved by: https://github.com/jeffdaily
2025-03-05 21:56:22 +00:00
2a639ce1d7 Add new hf storage class to torch.distributed package (#148361)
Summary:
title - Add new hf storage class  to torch.distributed package so that it can be imported by customers.
The HF storage reader/writer was added as DCP storage components so that DCP load and save can directly interact with hugging face format and storage.

Test Plan: ensure signals pass

Differential Revision: D70495399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148361
Approved by: https://github.com/MeetVadakkanchery
2025-03-05 21:52:06 +00:00
10354e146f Re-enable test_torchinductor:test_buffer_batch_norm (#148573)
Summary: Per https://github.com/pytorch/pytorch/issues/128198 seems like this is working now
Fixes #128198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148573
Approved by: https://github.com/StrongerXi
2025-03-05 21:51:24 +00:00
87bd3471ff [c10d] Move record param for init to the right place (#148571)
The place we do the log of init does not look correct. We move it to the beginning of comm init.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148571
Approved by: https://github.com/kwen2501
2025-03-05 21:43:30 +00:00
ad9a10aff0 [dynamo] Make nonstrict_trace work with some pytree.register_constant-ed instances (#148007)
As title, this enables `nonstrict_trace`-ed function to take in object
whose type has been `pytree.register_constant`-ed, as long as the object
existed outside the `torch.compile` region. This also forces Dynamo to
emit a `EQUALS_MATCH` guard on the object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148007
Approved by: https://github.com/zou3519
ghstack dependencies: #148385
2025-03-05 21:28:26 +00:00
a10f577ee0 [dynamo] Account for function id reuse in relevant Dynamo decorators (#148385)
This fixes a recent series of flaky failure from `nonstrict_trace` unit
tests: #148166, #148056, #148055, #148054, #148034, #148033, #148032, #148031.

For now we don't need to worry about the other decorators because they
are either meant for builtin/numpy functions (which should never
deallocate in practice), or used for polyfills which keeps the function
object in `get_torch_obj_rule_map()`.

Fixes #147777.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148385
Approved by: https://github.com/zou3519
2025-03-05 21:28:26 +00:00
4aeca28137 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-05 21:26:22 +00:00
ed9624ee60 [export] Fix AttrProxy slicing (#148507)
Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1159599265750221/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148507
Approved by: https://github.com/zhxchen17
2025-03-05 21:03:15 +00:00
dd6ec8706e [BE] Relax sympy dependency to 1.13.3 or newer (#148575)
Fixes https://github.com/pytorch/pytorch/issues/145225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148575
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-03-05 20:51:16 +00:00
9efa9c73f6 [Dyamo] Replace unimplemented with unimplemented_v2 for variables/distributed (#148500)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148500
Approved by: https://github.com/williamwen42
2025-03-05 20:41:43 +00:00
98458e5c81 Add a docstring to build.sh (#144566)
Add a little blurb to explain what build.sh is doing.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
2025-03-05 15:26:37 -05:00
c6a05df174 [ONNX] Use onnxscript apis for 2.7 (#148453)
Use onnxscript apis for 2.7.

Remove reference to `torchlib_opset()` and `torchlib_opset_version()` which were removed in the onnxscript 2.7 apis. These apis were removed because torchlib in onnxscript will always stay on opset 18. Future opset version bumps will happen in pytorch core after the migration of torchlib.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148453
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1
2025-03-05 20:10:00 +00:00
c9edd37ffb Revert "[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377)"
This reverts commit 9eef457c0241f87097a2ca7625f9961e31f3adcd.

Reverted https://github.com/pytorch/pytorch/pull/148377 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/13683650448/job/38261818684) [HUD commit link](9eef457c02) probably landrace ([comment](https://github.com/pytorch/pytorch/pull/148377#issuecomment-2701903810))
2025-03-05 19:45:16 +00:00
c5d92edd5a [dynamo] WeakRefVar reconstruct (#148083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148083
Approved by: https://github.com/anijain2305
2025-03-05 19:34:17 +00:00
50e827b3df [ONNX] Create VerificationInterpreter (#148396)
An fx interpreter for comparing ONNX values with pytorch ones.

```py
import torch
from torch.onnx._internal.exporter._verification import VerificationInterpreter

class Model(torch.nn.Module):
    def forward(self, query, key, value):
        res = torch.nn.functional.scaled_dot_product_attention(
            query, key, value
        )
        rest = res.transpose(0, 1)
        return rest.view(8, 32, 128 * 64)

model = Model()

query = torch.rand(32, 8, 128, 64, dtype=torch.float16)
key = torch.rand(32, 8, 128, 64, dtype=torch.float16)
value = torch.rand(32, 8, 128, 64, dtype=torch.float16)

onnx_program = torch.onnx.export(model, (query, key, value), dynamo=True)
interpreter = VerificationInterpreter(onnx_program)
interpreter.run(query, key, value)
for info in interpreter.verification_infos:
    print(info)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148396
Approved by: https://github.com/titaiwangms
2025-03-05 19:18:52 +00:00
8af79b7ec8 [ROCm] Bump AOTriton to 0.9.1b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-05 19:11:57 +00:00
9eef457c02 [dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA.

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377
Approved by: https://github.com/drisspg
2025-03-05 19:09:52 +00:00
9dd46a9233 Deprecate sm70 for cuda 12.8 binary (#147607)
follow up for https://github.com/pytorch/pytorch/pull/146265/files, dropping sm_70 as well, since "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release."

https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147607
Approved by: https://github.com/atalman
2025-03-05 18:54:17 +00:00
3f4311d589 [CD] Upgrade xpu runtime pypi packages version and enable windows kineto again (#148319)
Fixes https://github.com/pytorch/pytorch/issues/145155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148319
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-03-05 18:39:55 +00:00
9db9593bba Add some more meta kernels (#147862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147862
Approved by: https://github.com/zou3519
2025-03-05 18:33:00 +00:00
e555c4d8ae Fix bug in AOTI lowering (#148364)
Fixes: https://github.com/pytorch/pytorch/issues/148370

Differential Revision: [D70514480](https://our.internmc.facebook.com/intern/diff/D70514480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148364
Approved by: https://github.com/desertfire
2025-03-05 18:27:15 +00:00
38479e495e Add note to get start xpu (#148168)
Installing PyTorch from binaries will automatically install the runtime packages of Intel® Deep Learning Essentials. In this case, if we activate oneAPI in a standalone installation of Intel® Deep Learning Essentials, there will be an environment issue. Therefore, add a note to remind users to avoid this situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148168
Approved by: https://github.com/janeyx99

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-05 18:11:14 +00:00
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
70c5edb697 [ROCm] fix CK compile for gfx1200 (#148496)
gfx1200 causes the CK-based GEMM to fail to compile because CK is choosing an incorrect FP8 interpretation.  CK assumes FP8 interpretation is static and chosen prior to compilation.  This PR is a work-around that makes the selection dynamic during hipclang compilation passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148496
Approved by: https://github.com/jeffdaily
2025-03-05 16:11:03 +00:00
864b75dd50 [MPS] Fix unary_kernel_strided logic (#148512)
Fixes bug introduced by https://github.com/pytorch/pytorch/pull/148350
Before this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[  0.0000,   1.4142,   2.0000,   2.4495],
        [ 80.0000,  82.0000,  84.0000,  86.0000],
        [ 96.0000,  98.0000, 100.0000, 102.0000],
        [112.0000, 114.0000, 116.0000, 118.0000]], device='mps:0')
```
After this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[0.0000, 1.4142, 2.0000, 2.4495],
        [4.0000, 4.2426, 4.4721, 4.6904],
        [5.6569, 5.8310, 6.0000, 6.1644],
        [6.9282, 7.0711, 7.2111, 7.3485]], device='mps:0')
```
One can not avoid copies if both input and output tensors have the same strides, one needs to make sure that they are dense-in-storage (transposed tensor would be dense, but say selecting every odd and even column wouldn't)

Add regression test to prevent those from happening again

Also, no need to check that sizes match, luckily it is checked by the structured op (and `out` for unary ops does not support broadcasting, I just checked)

Revived needs_copy_logic, though it  will become irrelevant after https://github.com/pytorch/pytorch/pull/148468 is landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148512
Approved by: https://github.com/janeyx99
2025-03-05 15:57:54 +00:00
8274da9312 [c10d][PGNCCL] Fix capturability of isend and irecv (#148462)
This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode.

<details>
<summary>The repro code</summary>

```Python
import os
import torch
import torch.distributed as dist

USE_ASYNC = True

def test_func(x, rank):
    if rank == 0:
        x += 1
        # Send the tensor to process 1
        if USE_ASYNC:
            a = dist.isend(tensor=x, dst=1)
        else:
            dist.send(tensor=x, dst=1)
    else:
        # Receive tensor from process 0
        if USE_ASYNC:
            a = dist.irecv(tensor=x, src=0)
        else:
            dist.recv(tensor=x, src=0)
    if USE_ASYNC:
        a.wait()
    return x + 2

def run(rank):
    torch.cuda.set_device(rank)
    x = torch.ones(1, device='cuda')
    with torch.cuda.stream(torch.cuda.Stream()):
        for i in range(11):
            x.copy_(torch.ones(1, device='cuda'))
            y = test_func(x, rank)
            print(f"Rank{rank} has data {y} in warmup")
    torch.cuda.synchronize()
    graph = torch.cuda.CUDAGraph()

    x.copy_(torch.ones(1, device='cuda'))
    with torch.cuda.graph(graph):
        y = test_func(x, rank)

    for i in range(1):
        x.copy_(torch.ones(1, device='cuda'))
        graph.replay()
    print(f"Rank{rank} has data {y} after graph replay")

def main():
    rank = int(os.environ['RANK'])
    local_rank = int(os.environ['LOCAL_RANK'])
    world_size = int(os.environ['WORLD_SIZE'])
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    run(local_rank)

if __name__ == "__main__":
    main()
```
</details>

Fails with an error stating that work handle is of a NoneType:
```
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/repro.py", line 54, in <module>
[rank1]:     main()
[rank1]:   File "/workspace/repro.py", line 51, in main
[rank1]:     run(local_rank)
[rank1]:   File "/workspace/repro.py", line 38, in run
[rank1]:     y = test_func(x, rank)
[rank1]:         ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/repro.py", line 22, in test_func
[rank1]:     a.wait()
[rank1]:     ^^^^^^
[rank1]: AttributeError: 'NoneType' object has no attribute 'wait'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462
Approved by: https://github.com/kwen2501
2025-03-05 15:49:53 +00:00
19a6cf35f6 add input shape check for _local_scalar_dense (#145717)
Fix https://github.com/pytorch/pytorch/issues/145066.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145717
Approved by: https://github.com/malfet
2025-03-05 15:24:08 +00:00
96afa8a2bb [TEST][SPARSE] Simplify branching in test_cusparselt_backend (#148318)
Due to introduction of CUDA versions, the branching becomes more complicated. This PR is proposed to simplify branching in `test_cusparselt_backend` in order to avoid checking each and every CUDA version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148318
Approved by: https://github.com/jcaip
2025-03-05 10:17:00 +00:00
0ef2e938d0 [ROCm] [TunableOp] Track top solutions during tuning process (#147243)
For each set of GEMM parameters that are evaluated by Tunableop, keep track of the top 5 solutions. Print the top 5 solutions when `PYTORCH_TUNABLEOP_VERBOSE=2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147243
Approved by: https://github.com/jeffdaily
2025-03-05 09:35:02 +00:00
6c3492b491 [ROCm] Enable mi300-specific workflows to be triggered on PRs (#147904)
This change will be needed to be able to trigger the MI300-specific CI workflows on PRs by using a PR label.

* inductor-rocm-mi300.yml uses the existing `ciflow/inductor-rocm` label so that any PR manually labeled as such will trigger `inductor` config runs on both MI200 and MI300.
* rocm-mi300.yml uses a separate `ciflow/rocm-mi300` label, since we don't want to over-trigger `default` config runs on MI300 runners due to limited capacity, and [`ciflow/rocm` label is automatically applied](79438512a0/torchci/lib/bot/autoLabelBot.ts (L24)) on many PRs.
* inductor-perf-test-nightly-rocm.yml uses a separate `ciflow/inductor-perf-test-nightly-rocm` label, so that we can manually trigger a round of perf testing on MI300 runners to test the perf impact of a major inductor-related change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147904
Approved by: https://github.com/huydhn
2025-03-05 06:00:37 +00:00
2295efa1b3 Fix only logging ir_post_fusion with torch_compile_debug enabled (#148499)
Because we were invoking the logs through `V.debug`, it was not running if TORCH_COMPILE_DEBUG was not set. this is because there is some magic the in debug [getattr](d789c22712/torch/_inductor/debug.py (L468-L480)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148499
Approved by: https://github.com/shunting314
2025-03-05 05:35:09 +00:00
fb1b7ec173 Remove deprecate method and attirbute in LRScheduler (#147301)
Following [#99270 suggestion](https://github.com/pytorch/pytorch/issues/99270#issuecomment-1511656408), remove deprecate method `LRScheduler.print_lr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147301
Approved by: https://github.com/janeyx99
2025-03-05 05:30:19 +00:00
df7e43e5d4 [AOTI] Fix aot_inductor_package test errors (#148279)
Summary: Fix fbcode test failures introduced by https://github.com/pytorch/pytorch/pull/147975. Make sure script.ld is copied to the build-time directory.

Differential Revision: D70454149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148279
Approved by: https://github.com/zoranzhao
2025-03-05 05:22:48 +00:00
b020d166f2 stage 1 of depreate silent fallback of tuning gemm (#147798)
Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/)

context:
https://github.com/pytorch/pytorch/issues/147479

For the most part, this should not change the behavior.

For int_mm, I also removed
```
    # TODO: Re-enable eager mode implementation once cuBLAS is fixed
    if use_cutlass or use_triton_template(layout, enable_int32=True):
        choices = []
```
because I think it is unwanted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798
Approved by: https://github.com/eellison
2025-03-05 05:15:59 +00:00
913356fb41 Fix recent regression in evaluate_expr that effect cache lookups (#147836)
PR https://github.com/pytorch/pytorch/pull/146939/ added an argument for evaluate_expr for the purpose of logging.
This caused a regression that we thought is due to calling id on symnode.

I digged deeper and found that adding that argument although does not effect results of evaluate_expr it mess the cache
lookups.
I refactored the code to avoid using expr_sym_node_id in the cache lookup, I also introduced evaluate_sym_node to and simplified the calls to evaluate_expr
#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147836
Approved by: https://github.com/oulgen
2025-03-05 04:11:41 +00:00
ed8ec0cb98 [cutlass backend][BE] Fix two small things in cutlass backend standalone debugger (#148493)
Differential Revision: [D70583777](https://our.internmc.facebook.com/intern/diff/D70583777/)

Two really small things:
* The bits in BlockFillRandomUniform would round float to ints
* when bias exists, the order of args are C, A, B, D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148493
Approved by: https://github.com/chenyang78
2025-03-05 04:01:36 +00:00
e0ea593974 [CD] Upgrade Windows xpu support package to 2025.0.1 for binary compression (#148313)
The binary compression feature can reduce the size of the Torch XPU Windows wheel packages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148313
Approved by: https://github.com/atalman
2025-03-05 03:00:27 +00:00
1673bc7610 [mm_logs][ez] dump tuned mm info at lowering stage (#148363)
Summary:
As title. it would be beneficial for judging e2e perf improvement

Easy first step to dump mm info at lowering stage.

e.g.

```
fbsource/fbcode/caffe2/torch/_inductor/kernel/mm.py:525] [0/0] Tuned aten.addmm: m=16, n=6, k=16, layout=FixedLayout('cuda:0', torch.float32, size=[16, 6], stride=[6, 1])
```

Next step:

Dump overview info at `post_grad_graph` stage such as
overall count of `aten.mm` in the graph & visualize to a table structure.

Test Plan: by looking very hard in aot inductor bmm and mm UTs.

Differential Revision: D70507880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148363
Approved by: https://github.com/henrylhtsang
2025-03-05 02:21:27 +00:00
edc3ca577e [Profiler] Add profiler activity for HPU devices (#148182)
Fixes #148181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148182
Approved by: https://github.com/sraikund16
2025-03-05 01:37:48 +00:00
3985ce0b88 [dynamo] rename test_graph_break_messages -> test_error_messages (#148220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148220
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #148205
2025-03-05 01:16:53 +00:00
b28cbe5db3 [dynamo] remove internal stack trace for fullgraph=True graph breaks (#148205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148205
Approved by: https://github.com/zou3519
2025-03-05 01:16:53 +00:00
2927a64357 [inductor][cpu] Fix error with FlexibleLayout weights in BMM (#148188)
Fixes #148074

When node A is reshaped (is a `ReinterpretView`) and node B has a `FlexibleLayout`, then the layout of node B *may* be changed during the `kernel.select(options["W"], 0, self.b_index)` call, which could cause the assertion in `kernel.select` to fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148188
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-03-05 01:05:05 +00:00
713a504a82 [dynamo][guards] Fix mem leak caused be refcount increment (#148480)
Should help [internalfb.com/sevmanager/view/491701](https://www.internalfb.com/sevmanager/view/491701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148480
Approved by: https://github.com/xmfan, https://github.com/StrongerXi, https://github.com/williamwen42, https://github.com/zou3519
2025-03-05 01:04:08 +00:00
b5873292c6 Add overload names to profiler trace (#143114)
Currently, recorded profiler events for aten ops do not store overload names. It would be useful to know which overloads are actually called to analyse performance.
For example, consider the following dispatch trace which occurs if there is a fallthrough kernel registered for aten::add:
```
             [call] op=[aten::add.Tensor], key=[AutogradCPU]
               [redispatch] op=[aten::add.Tensor], key=[Undefined]
                 [call] op=[aten::empty.memory_format], key=[BackendSelect]
                   [redispatch] op=[aten::empty.memory_format], key=[CPU]
                 [call] op=[aten::add.out], key=[CPU]
```

In this case, aten::add.out is a child of aten::add.Tensor, however the current profiler trace provides no way to differentiate aten op calls.

See the added unit test for a more detailed example.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143114
Approved by: https://github.com/sraikund16
2025-03-05 01:00:29 +00:00
cf5e3f3cea Add cutlass kernel for rowwise scaled mm on sm100 (#148421)
### Important
- Previous PR in stack https://github.com/pytorch/pytorch/pull/148274
- Despite the changes between sm90 vs sm100 being fairly minimal, I created a separate kernel since we'll be making various arch specific perf optimizations to the sm100 kernel next.
- This kernel has not been optimized yet. However, initial perf testing shows numbers which indicates the tensorcores are being utilized as expected (not just CUDA cores).

### Summary of changes
- This PR adds a new cutlass kernel for rowwise GEMM on sm100.
- sm100 kernel is based on sm90 kernel, with the following changes:
  - Use new arch tag `cutlass::arch::Sm100`
  - Do not use [large tile](4eb0c45297/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (L203)) schedule in CollectiveMainLoop or CollectiveEpilogue (causes build errors)
- SM90 vs SM100 kernel diff: https://www.diffchecker.com/ZCAPaFAg/

### Next steps
- Arch specific performance optimization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148421
Approved by: https://github.com/drisspg
2025-03-05 00:46:01 +00:00
a907b6abae [compiled_autograd] workaround windows compilation issue (#148454)
torch.compile doesn't work on windows so we can ifdef-away the problem.
I do not know what the root cause actually is. Most notably, the pytorch
windows build is fine, but some third-party projects that use pytorch headers
on windows (e.g. torchaudio) have issues.

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148454
Approved by: https://github.com/atalman, https://github.com/xmfan
2025-03-05 00:18:20 +00:00
e02a2ca07a Fix dist.init_process_group on windows (#148266)
Fix https://github.com/pytorch/pytorch/issues/139990

We don't build libuv on windows so anything that creates `TCPStore` which includes `init_process_group()` will fail, which is a bad experience. We should just default to `USE_LIBUV=0` for windows. There were a decent amount of hits for this [error on google ](https://www.google.com/search?q=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&sca_esv=921f59ac5f8bd98a&sxsrf=AHTn8zpG3PxdKoomFHkclOc451rBhoc3jw%3A1740854890873&source=hp&ei=albDZ5GHM-uIptQP4NTikQw&iflsig=ACkRmUkAAAAAZ8Nkei9H-aB2IBCk3pUOK3yFl5xBLZUt&ved=0ahUKEwiR5P7qxemLAxVrhIkEHWCqOMIQ4dUDCBg&uact=5&oq=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&gs_lp=Egdnd3Mtd2l6IkN1c2VfbGlidXYgd2FzIHJlcXVlc3RlZCBidXQgUHlUb3JjaCB3YXMgYnVpbGQgd2l0aG91dCBsaWJ1diBzdXBwb3J0SABQAFgAcAB4AJABAJgBAKABAKoBALgBA8gBAPgBAvgBAZgCAKACAJgDAJIHAKAHAA&sclient=gws-wiz) and https://github.com/pytorch/pytorch/issues/139579, so I figured we should add a more helpful message as well.

We don't have CI for windows and our support is just best effort, so I just tested these changes on my windows machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148266
Approved by: https://github.com/d4l3k
2025-03-05 00:07:56 +00:00
84b58bd63e Enable FSDP tests on XPU device (#147518)
**Motivation:**

Enable FSDP tests on XPU device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518
Approved by: https://github.com/weifengpy
2025-03-04 23:49:37 +00:00
c98c3af421 Add a couple config options to compiler bisector (#148450)
These are commonly source of bugs/divergence (through bad interactions etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148450
Approved by: https://github.com/shunting314
2025-03-04 23:23:21 +00:00
0c0a4baddd [MPS] unary kernels - avoid copying tensors if they have same stride (#148350)
I was a bit concerned when I saw in #148272 that metal unary kernel was 0.02x of the performance of what we had with MPS Graphs for sqrt(for non contiguous) tensors. This change makes it so that copying is only done if we don't have same strided tensors(for input/output). So if out tensor is not provided then we don't do copy(don't call contiguous) at all and dispatch the kernel as is. After making this change the script that I listed at the end of the above PR has the same execution time as the non-transposed one.

Times for reference(on transposed tensor where matrix is NxN matrix):

| N     | time_old           | time_new           |
|-------|--------------------|--------------------|
| 100   | 0.0002241021       | 0.0001548659       |
| 1000  | 0.0005934822       | 0.0002150342       |
| 10000 | 0.3242016407       | 0.0045755033       |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148350
Approved by: https://github.com/janeyx99
2025-03-04 23:20:26 +00:00
ade4af8c95 [MPS][BE] Fix c10:🤘:sinc implementation (#148471)
Restrict scalar implementation to `is_scalar_floating_point_v` types, but perform all internal computations in full 32-bit floats. Make complex implementation a template for `is_complex_v` types
This makes its eager kernel implementation for both real and complex type a trivial call to the template
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148471
Approved by: https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448, #148449
2025-03-04 23:14:03 +00:00
93e9daed54 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-03-04 23:09:09 +00:00
d789c22712 Upgrade github ubuntu-20.04 runners to ubuntu-24.04 (#148469)
The github provided ubuntu-20.04 gha runners are being deprecated (https://togithub.com/actions/runner-images/issues/11101) so upgrade workflows using them to the latest runner 24.04

They are currently doing a brownout, resulting in failures like: https://github.com/pytorch/pytorch/actions/runs/13660782115
```
[do_update_viablestrict](https://github.com/pytorch/pytorch/actions/runs/13660782115/job/38192777885)
This is a scheduled Ubuntu 20.04 brownout. Ubuntu 20.04 LTS runner will be removed on 2025-04-01. For more details, see https://github.com/actions/runner-images/issues/11101
```

Should we be using ubuntu-latest instead?

I attempted to upgrade actionlint to 1.7.7 but on my local in test-infra it seems to add a lot of new checks, and on test-infra's CI, I seem to have uploaded the wrong executable or something so it failed.  I'll try again later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148469
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-04 22:29:04 +00:00
5f47b7e268 [ROCm][TunableOp] Unit test for offline tuning of GEMM with bias (#148371)
One more unit test for the offline version of TunableOp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148371
Approved by: https://github.com/jeffdaily
2025-03-04 22:24:27 +00:00
842ffea445 [MPS][BE] Towards strided unary ops support (#148449)
Add generic functors kernels and rewrite all existing implementations into functors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148449
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448
2025-03-04 22:22:39 +00:00
70d0e1b96a Bump onnxscript to 0.2.2 in CI (#148388)
Unblock https://github.com/pytorch/pytorch/pull/148140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388
Approved by: https://github.com/malfet
2025-03-04 22:09:50 +00:00
c677f3251f [export] don't use unbacked_renamings in export (#147574)
Plan: avoid the use of unbacked renamings, and introduce a pass run in `_produce_aten_artifact` that recomputes unbacked bindings. Decided to do this because in we don't serialize unbacked renamings (or any ShapeEnv state), so this used to compose poorly with de/serialization. This hopefully establishes the invariant that the unbacked binding keys are always in sync with the example values (i.e. same indices, and removed if the symbol is replaced / specialized).

For de/serialization, we don't stored unbacked bindings, and just rerun the pass.

Involved a refactor of compute_unbacked_bindings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147574
Approved by: https://github.com/avikchaudhuri
2025-03-04 21:43:49 +00:00
84961a0c17 ci: Add workflow dispatch for commit hash update (#148486)
Maybe this should also be split into its own workflow instead of piggy
backing off of nightly?

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148486
Approved by: https://github.com/clee2000
ghstack dependencies: #148466, #148472
2025-03-04 21:26:23 +00:00
d290186ed3 ci: Add triton to update hash workflow (#148472)
Adds triton to our auto-update workflows so that PRs can be
automatically made and the triton team can follow up to fix any issues
that may arise.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148472
Approved by: https://github.com/Camyll, https://github.com/atalman
ghstack dependencies: #148466
2025-03-04 21:26:23 +00:00
9be8f74156 ci: Consolidate commit hash updates into a matrix (#148466)
Consolidates all of our commit hash update jobs into a single matrix to
make it easier to add more jobs later on.

Side note: How do I even test if this works?

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148466
Approved by: https://github.com/Camyll, https://github.com/clee2000, https://github.com/atalman
2025-03-04 21:26:13 +00:00
d1abde11ec [dynamo] Support passing arguments to DeviceMesh.get_group (#147741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147741
Approved by: https://github.com/StrongerXi
2025-03-04 21:19:47 +00:00
f30776c37a [BE] Upgrade to mypy 1.14 (#145966)
Upgrade mypy version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145966
Approved by: https://github.com/Skylion007
2025-03-04 20:58:26 +00:00
60205b0eb2 [export] Fix logging so that it doesn't result in max recursion error (#148231)
Test Plan:
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id=487493491 --test_suite ads_all --mode test_full_model

Produces https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp2wsjQH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Differential Revision: D70416613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148231
Approved by: https://github.com/yiming0416
2025-03-04 20:47:25 +00:00
e4c558be1d [scan] Corrections for scan (#146110)
This PR resolves some minor issues with the scan HOP and unifies the handling of the additional_inputs in the same way as for associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146110
Approved by: https://github.com/ydwu4
2025-03-04 20:29:08 +00:00
439395c0ae [MPS] add slogdet and logdet implementations to mps (#148287)
Low hanging fruits, all ops for these are implemented so just adding them to native functions adds the functionality on mps. Probably next op I should add should be lu solve seeing as how many ops need it for the grad calculation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148287
Approved by: https://github.com/malfet
2025-03-04 19:49:23 +00:00
92beda54c8 Revert "[fx] Move map_aggregate to C++ (#148243)"
This reverts commit edaff88f69f069d517b72ea23fd5eb04702eb0b5.

Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
17d003fe75 Revert "[fx] Move Node._update_args_kwargs to C++ (#148260)"
This reverts commit 0135f57f4aaeaba8d720f551eab6dca6fcede8cd.

Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
97b9e68bc6 Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)"
This reverts commit 29c2de9ae16f1673f3f44363243294d403e53d37.

Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
6fb18ff685 Revert "Better log message to update pr_time_benchmarks/expected_results.csv (#148303)"
This reverts commit a3d69e6e1a530ae2b91cd549ea26aac51ffc7566.

Reverted https://github.com/pytorch/pytorch/pull/148303 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
63778cb8a0 Revert "[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#147019)"
This reverts commit e3e45d90d8578083da8b51a3b1d911e9a4523e5b.

Reverted https://github.com/pytorch/pytorch/pull/147019 on behalf of https://github.com/clee2000 due to broke inductor test inductor/test_max_autotune.py::TestMaxAutotune::test_cat_max_autotune_extern [GH job link](https://github.com/pytorch/pytorch/actions/runs/13653495421/job/38171259603) [HUD commit link](e3e45d90d8) on inductor workflow and rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/147019#issuecomment-2698677222))
2025-03-04 19:20:15 +00:00
9d196edb7d Revert "Bump onnxscript to 0.2.2 in CI (#148388)"
This reverts commit 7ab6749ec7db32e0b3cdfd19db087f15dd0bebe2.

Reverted https://github.com/pytorch/pytorch/pull/148388 on behalf of https://github.com/clee2000 due to broke libtorch debug build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/13646179239/job/38152039312) [HUD commit link](7ab6749ec7) ([comment](https://github.com/pytorch/pytorch/pull/148388#issuecomment-2698665495))
2025-03-04 19:16:34 +00:00
c219c5ca38 Fix code descriptions in the test package. (#148145)
The parameter and function description have something wrong and make them correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148145
Approved by: https://github.com/janeyx99
2025-03-04 19:14:41 +00:00
e8900fbe4f [MPS] Add some useful utils (#148448)
Like `is_compex_v`, `is_scalar_intergral_v`, `result_of` etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148448
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399
2025-03-04 19:09:17 +00:00
f859722f70 [dtensor] refactor sharding prop to handle cross mesh computation (#147869)
as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level.

This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way".

This should also fix https://github.com/pytorch/pytorch/issues/134212

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869
Approved by: https://github.com/tianyu-l
2025-03-04 18:30:44 +00:00
eea54a55f6 ci: Switch manywheel build.sh to just use dev (#148310)
To avoid annoying error message like:

> fatal: no tag exactly matches 'a6520c85bd85875b09f2c68e51622699d7d07595'

These were popping up when GITHUB_REF is not set so let's just assume
that if someone is building without directly setting GITHUB_REF then
they're probably doing a dev build.

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148310
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-03-04 18:27:44 +00:00
611b0e9bc4 Revert "[fx] Optimizations for node name generation (#148288)"
This reverts commit 5eb0337cfd5e7c2cdf4a2d4829609e391467270f.

Reverted https://github.com/pytorch/pytorch/pull/148288 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
ed9055c303 Revert "[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)"
This reverts commit 8531d247ba411993f9a10686d70514f6945f9960.

Reverted https://github.com/pytorch/pytorch/pull/148292 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
67937be673 [BE] Move sinc kernels to the same OP family (#148399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148399
Approved by: https://github.com/dcci
ghstack dependencies: #148398
2025-03-04 15:49:20 +00:00
7fcbaff206 [BE] Remove stale arg for complex ops (#148398)
Not need to pass DTYPE0 and DTYPE1 if only one DTYPE is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148398
Approved by: https://github.com/dcci
2025-03-04 14:35:43 +00:00
f2f25a5444 Upgrade submodule oneDNN to v3.7.1 (#148293)
This PR is to upgrade submodule oneDNN to v3.7.1.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293
Approved by: https://github.com/mingfeima, https://github.com/atalman
2025-03-04 13:56:45 +00:00
f339e41a38 [inductor][triton] Fix average pool nd for int64 dtype (#146061)
The eager mode implementation of average pool nd returns an integer tensor if the input is also an integer tensor. This should also be preserved in inductor.

Fixes pytest -k test_comprehensive_nn_functional_avg_pool2d_cpu_int64 error: Triton compilation failed: triton_poi_fused_avg_pool2d_0

See WIP https://github.com/pytorch/pytorch/pull/145865#issuecomment-26200289890 to potentially enable such tests as they aren't enabled yet.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146061
Approved by: https://github.com/eellison
2025-03-04 13:53:50 +00:00
fdee60769a [DCP] Introduce process based async checkpointing (#147039)
Summary:
### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: Added E2E UTs for process based async save.

Differential Revision: D69272583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039
Approved by: https://github.com/saumishr
2025-03-04 13:33:28 +00:00
16d07988fc add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501, https://github.com/albanD
2025-03-04 12:37:06 +00:00
e3e45d90d8 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#147019)
Modified  TorchInductor’s autotuning flow so that each `best_config` JSON file also includes the Triton “base32” (or base64) cache key.

**Motivation**

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set `store_cubin = True` since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147019
Approved by: https://github.com/davidberard98
2025-03-04 12:16:38 +00:00
f1cce0951b Create unique test report files for distributed tests (#148325)
The distributed tests are executed once for each backend and for each init method.
`$TEST_REPORT_SOURCE_OVERRIDE` is used such that test results from different backends are stored in different files.
The same needs to be done for the init method.

Move the setting of the variable into `test_distributed` and incorporate the init method into the name.

Useful for e.g. https://github.com/pytorch/pytorch/issues/126523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148325
Approved by: https://github.com/clee2000
2025-03-04 10:45:33 +00:00
0b0d28accd Optimize param prepend class reference torch.nn.Module (#148304)
Fixes #147696

## Changes

Change `prepend` description  `torch.nn.modules.Module` to `torch.nn.Module`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/054f54b7-9487-4505-a926-3e17a84bd2f9)

### After

![image](https://github.com/user-attachments/assets/1d2a5708-62d1-428e-b136-bcaa35e5e6da)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148304
Approved by: https://github.com/Skylion007
2025-03-04 08:46:14 +00:00
da2688f624 Introduce delayed compile via eager_then_compile stance (#147983)
Recently I've been experimenting with introducing new APIs to delay compile as a way to reduce compile times while improving the ergonomics of using dynamic shapes. The high level idea is to run the first invocation of compile in eager, save the example inputs, and on the second invocation we can derive the dynamism in the inputs so that we don't need to waste our time doing a compile with static shapes (which is the status quo today with automatic dynamic).

Another benefit of this is most users no longer need to annotate their inputs with mark_dynamic and mark_unbaked calls since we can derive the dynamism on the very first call. Additionally we get dynamic ints out of the box in this new regime.

This PR implements this idea through the set_stance APIs. In particular it introduces a new `eager_then_compile` stance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147983
Approved by: https://github.com/williamwen42
2025-03-04 07:46:31 +00:00
e0f0db0105 updates to benchmarks (#144831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144831
Approved by: https://github.com/danielvegamyhre
2025-03-04 06:21:12 +00:00
ac99fc7e57 Updates to build rowwise scaled mm kernel on SM10.0a (#148274)
## Summary
Update cmake files and RowwiseScaledMM.cu to build on SM10.0a arch.

**NOTE**: performance optimization will be done in separate follow up PRs

## Steps to verify build
1. Access devgpu/machine with B200 GPUs, verify B200s are visible w/ `nvidia-smi`
2. Install CUDA tookit 12.8
    - e.g. see [Nvidia docs](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Rocky&target_version=9&target_type=rpm_local)
3. Verify CUDA toolkit installation
    - e.g. `nvcc --version` should have `... Cuda compilation tools, release 12.8 ... ` in output
4. Set env var `TORCH_CUDA_ARCH_LIST=10.0a`
4. Build pytorch from source with this PR ([steps](https://github.com/pytorch/pytorch#from-source))
5. Uninstall `pytorch-triton` with `pip uninstall pytorch-triton`
6. Build and install triton from source: https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source
7. Run tests shown in test plan below

**NOTE**: performance optimization will be done in a separate PR. The goal of this PR is just to ensure it builds correctly.

## Test plan
- `python test/distributed/tensor/test_matrix_ops.py  -k scaled_mm`: OK
- `python test/test_matmul_cuda.py -k rowwise`: OK
- `python test/test_flop_counter.py -k scaled_mm`: OK
- `python test/inductor/test_aot_inductor.py -k fp8`: OK
- `python test/inductor/test_fp8.py`: OK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148274
Approved by: https://github.com/drisspg
2025-03-04 05:23:41 +00:00
7ab6749ec7 Bump onnxscript to 0.2.2 in CI (#148388)
Unblock https://github.com/pytorch/pytorch/pull/148140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388
Approved by: https://github.com/malfet
2025-03-04 04:21:58 +00:00
d54cab78e1 [codemod] Fix missing field initializer in caffe2/torch/lib/libshm/manager.cpp +1 (#148393)
Summary:
The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers.

This can be unintended such as:
```
my_struct s1 = {0}; // Initializes *only* the first field to zero; others to default values
my_struct s2 = {}; // Initializes *all* fields to default values (often zero)
```
or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated.

To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D70472663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148393
Approved by: https://github.com/Skylion007
2025-03-04 04:20:04 +00:00
70410f93f2 doc/xpu: align description of SyclExtension with CPP/CUDA (#147988)
This commit just aligns description of `py_limited_api` feature in SyclExtension with CPP/CUDA. We've missed this change on doing SyclExtension due to parallel work on the changes. For CPP/CUDA change was done in 515e55e6927ad5f57ec222d7779712630341acf3.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147988
Approved by: https://github.com/janeyx99, https://github.com/guangyey
2025-03-04 04:17:36 +00:00
cyy
ec2805ada8 Remove outdated CUDA version check (#148142)
Since Torch requires CUDA>=11, some checks can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142
Approved by: https://github.com/janeyx99, https://github.com/eqy
2025-03-04 03:33:44 +00:00
cyy
98bf2f1170 Use Python 3.9 typing (#148157)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157
Approved by: https://github.com/janeyx99
2025-03-04 03:09:55 +00:00
cyy
b7832f0339 Enable ASAN in CUDA tests (#147812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147812
Approved by: https://github.com/janeyx99
2025-03-04 02:50:39 +00:00
8531d247ba [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303, #148288
2025-03-04 02:42:23 +00:00
5eb0337cfd [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303
2025-03-04 02:42:23 +00:00
a3d69e6e1a Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
ghstack dependencies: #148243, #148260, #148261
2025-03-04 02:42:23 +00:00
17518007b2 [cutlass backend] Benchmark compared to aten and triton (#148347)
Benchmark for cutlass backend.

```
python benchmarks/inductor_backends/cutlass.py
```

Test Plan:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 12.759539298713207 |  2.7271360370796174  |         NA          |
|        triton         | 10.573655366897583 |  1.8661278090439737  | -17.131370346859384 |
| triton_persistent_tma | 10.884030722081661 |  0.5315794269554317  | -14.698873781600327 |
|  cutlass_lvl_default  | 13.09632882475853  |  0.5520401500398293  | 2.6395116481931873  |
|   cutlass_lvl_1111    | 11.05172373354435  |  0.569593315012753   | -13.384617776451302 |
|   cutlass_lvl_2222    | 11.371277272701263 |  133.58984916994814  | -10.880189272601317 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 14.472318813204765 |  1.5445372510002926  |         NA          |
|        triton         | 10.568295605480671 |  16.583424195996486  | -26.975796056689987 |
| triton_persistent_tma | 10.45411266386509  |  5.830657540936954   | -27.764770809729562 |
|  cutlass_lvl_default  | 12.742593884468079 |  28.994930602959357  | -11.951954286402668 |
|   cutlass_lvl_1111    | 11.522261425852776 |  79.85037935699802   | -20.38413764531163  |
|   cutlass_lvl_2222    | 10.993581265211105 |  132.86601971101481  | -24.037181552548486 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.700622126460075 |  2.225986961973831   |         NA          |
|        triton         | 29.17378954589367  |  38.571991189033724  |  -4.97329524553989  |
| triton_persistent_tma | 29.642896726727486 |   7.2848734309664    | -3.4452897904663744 |
|  cutlass_lvl_default  | 29.514770954847336 |  29.819900761009194  | -3.8626291243482167 |
|   cutlass_lvl_1111    | 29.411429539322853 |  23.82907024596352   |  -4.19923929172139  |
|   cutlass_lvl_2222    | 29.57325428724289  |  134.31008586101234  | -3.672133530628152  |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 30.858177691698074 |  1.181898436974734   |         NA         |
|        triton         | 28.630023822188377 |  39.24473957403097   | -7.220626868414034 |
| triton_persistent_tma | 28.641965240240097 |  5.275042273919098   | -7.181929126210897 |
|  cutlass_lvl_default  | 29.16003204882145  |  29.934022572939284  | -5.503065216107967 |
|   cutlass_lvl_1111    | 28.79570797085762  |  23.948012012057006  | -6.683705504085324 |
|   cutlass_lvl_2222    | 29.02756631374359  |  136.25560767308343  | -5.932337924306467 |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1456.143856048584  |  1.020197194069624   |         NA         |
|        triton         | 1708.2737684249878 |  5.766509635956027   | 17.31490410985819  |
| triton_persistent_tma | 1476.485013961792  |  7.455113030038774   | 1.3969195302177155 |
|  cutlass_lvl_default  | 1583.3594799041748 |  50.408804678940214  | 8.736473620182366  |
|   cutlass_lvl_1111    | 1636.4418268203735 |  82.82403108896688   | 12.381879030898025 |
|   cutlass_lvl_2222    | 1507.5665712356567 |  260.03901409788523  | 3.531430975962381  |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1382.230520248413  |  1.2586536260787398  |         NA         |
|        triton         | 1646.9683647155762 |  5.442052865982987   | 19.15294450447995  |
| triton_persistent_tma | 1423.9195585250854 |  6.515797697938979   | 3.016069871556595  |
|  cutlass_lvl_default  | 1500.9030103683472 |  51.36402789200656   |  8.58557877152115  |
|   cutlass_lvl_1111    | 1446.9740390777588 |  30.65435610699933   | 4.683988515729638  |
|   cutlass_lvl_2222    | 1419.661521911621  |  205.1948991640238   | 2.7080144096717635 |
+-----------------------+--------------------+----------------------+--------------------+
```

Differential Revision: D70147589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347
Approved by: https://github.com/drisspg, https://github.com/chenyang78
2025-03-04 01:45:36 +00:00
c21dc11a17 [Intel GPU] Enable SDPA on XPU (#147614)
Motivation
===

This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`.

Depends on OneDNN version v3.7 upgrade in #147498
Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-04 01:40:45 +00:00
b17f5223a4 Generate AOTI input check by default (#148005)
Summary:
Generate AOTI size and stride input check by default. But the checks are only run if `AOT_INDUCTOR_DEBUG_COMPILE` env variable is set (to avoid slowing down the performance).

Example output:

```cpp
            bool _check_aoti_runtime_check_inputs_env() {
                const static char* env_var_value = getenv("AOTI_RUNTIME_CHECK_INPUTS");
                const static bool result = env_var_value != nullptr && env_var_value[0] != '\0';
                return result;
            }

            AOTI_NOINLINE static void __check_inputs_outputs(
                AtenTensorHandle* input_handles,
                AtenTensorHandle* output_handles) {
                if (!_check_aoti_runtime_check_inputs_env()){
                    return;
                }
//rest of the check
}

```

Test Plan: CI

Differential Revision: D70260490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148005
Approved by: https://github.com/hl475, https://github.com/desertfire, https://github.com/jingsh
2025-03-04 00:55:14 +00:00
0bd2caac55 Docker release - pin buildkit to v0.19.0 (#148372)
Fix nightly build failure during arm64 docker build (since 02.21.2025): https://github.com/pytorch/pytorch/actions/runs/13452177170/job/37588508155#step:12:851

Error:
```
#10 73.62 Segmentation fault (core dumped)
#10 73.67 qemu: uncaught target signal 11 (Segmentation fault) - core dumped
#10 73.85 Segmentation fault (core dumped)
#10 73.85 dpkg: error processing package libc-bin (--configure):
#10 73.85  installed libc-bin package post-installation script subprocess returned error exit status 139
```
Looks like we are hitting: https://github.com/moby/buildkit/issues/5783

Update setup-qemu and buildkit actions to v3 and buildkit to v0.19.0

Please note: CUDA 12.8 error is not related to this failure in nightly cpu arm64. Looks like we are trying to install release torch when running on PR. Cuda 12.8 build is not released yet, hence a failure. Will send followup to make sure we are using nightly torch when running on PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148372
Approved by: https://github.com/seemethere
2025-03-03 23:55:30 +00:00
d43c6f0033 [invoke_subgraph] Run joint passes on the hop graphs (#139325)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139325
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #147559
2025-03-03 23:38:14 +00:00
216a108aaf [ROCm] Add rocm-mi300 and inductor-rocm-mi300 to upload-test-stats.yml (#148365)
We currently run MI300X machines on rocm-mi300 and inductor-rocm-mi300 but we don't have artifacts for the results:
e.g.
6e10471966 (rocm-mi300)
![image](https://github.com/user-attachments/assets/f5588072-b818-4f54-a348-0e6ac7e96829)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148365
Approved by: https://github.com/jeffdaily
2025-03-03 23:22:56 +00:00
586d8df651 Fix condition for CONVERT_NON_VECTORIZED_INIT invocation (#148362)
Yet another regression caused by https://github.com/pytorch/pytorch/pull/146596 that breaks builds if PyTorch is compiled for Android or using NVIDIA GraceHopper systems

Not sure why author was trying to change the conditon to begin with

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148362
Approved by: https://github.com/izaitsevfb
ghstack dependencies: #148354
2025-03-03 23:13:37 +00:00
5887a2d8de [BE] Use C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED (#148354)
Instead of `#pragma GCC diagnostic ignored "-Wignored-qualifiers"`
Also limit the scope to just `Vectorized::map` that has to be declared that way due to sleef function signature definitions that return `const __m256` for AVX2 methods

Also delete `#pragma GCC diagnostic pop` from vec256_half and vec256_bfloat16 as it results in an unbalanced pop warning, for push that is defined in vec256_16bit_float, which will be included only once
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:232:27: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas]
  232 | #pragma GCC diagnostic pop
      |                           ^
1 warning generated.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148354
Approved by: https://github.com/izaitsevfb
2025-03-03 23:00:47 +00:00
d0b23e661d [cutlass backend] Add main tests for mm, addmm and bmm - step 1 (#148229)
This adds very good coverage for normal mm tests {aoti x torch.compile} x {default, dynamic}.

There are some parts that are less tested. For example:
* different layout combo
* shapes that are less aligned

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148229
Approved by: https://github.com/chenyang78
2025-03-03 22:31:46 +00:00
a41413829c Use release notes label for module: distributed_checkpoint (#148352)
module: distributed_checkpoint is redundant with oncall: distributed checkpointing.

@fduwjj let us know that module: distributed_checkpoint is just used for release notes, so let's use the release notes label for the release notes, which the bot will pick up better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148352
Approved by: https://github.com/fegin
2025-03-03 21:33:28 +00:00
e45040b1d3 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang
2025-03-03 21:32:21 +00:00
52078154f2 Add support for no-op concat with padded output (#146866)
Add support for no-op concat with padded output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146866
Approved by: https://github.com/shunting314
2025-03-03 21:10:46 +00:00
07f876e960 Subprocess compile (#146134)
Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134
Approved by: https://github.com/jamesjwu
2025-03-03 21:10:12 +00:00
8f361c808b [dynamo] run-only recursively on recompile limit exceeded (#148021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148021
Approved by: https://github.com/anijain2305
2025-03-03 21:01:08 +00:00
1bbe57336b Replace unimplemented with unimplemented_v2 for dynamo (#148158)
torch/_dynamo/variables/constant.py

https://github.com/pytorch/pytorch/issues/147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148158
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
2025-03-03 21:00:17 +00:00
b162b1600b [Inductor] Hot fix after #148011 (#148270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148270
Approved by: https://github.com/davidberard98
2025-03-03 20:18:21 +00:00
d260d4fc55 HSDP custom hook UTs are multi-threaded - can't set device rank (#148099)
HSDP custom hook UTs are multi-threaded and using single physical GPU. If we set rank in each thread, then we are referencing the same GPU with multiple ranks, which isn't right. Therefore, removing the rank setting from these UTs. Now, they are passing with 1, 2, 4 GPUs.

Fixes #147767 and #147769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148099
Approved by: https://github.com/jeffdaily
2025-03-03 19:48:49 +00:00
302c660298 Consistently use load_torchbind_test_lib in tests (#148082)
The same code is repeated multiple times with slightly different implementations.
Use the existing function for brevity and consistency.

In the function the code from `test_export` is used which does a single `load_library` with cleaner conditions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148082
Approved by: https://github.com/angelayi
2025-03-03 19:37:28 +00:00
40c2505f16 [logging] Log individual Triton kernel compilation times to dynamo_compile (#147022)
Summary: Gather the compilation time of individual triton kernels and log them to dynamo_compile:
* Time compilation in `_worker_compile_triton` and pass back to the main process and logged from `get_result()`.
* Added a way to track the "top N" (or N most-expensive compiles) in the metrics_context. I did this because I doubt we really care to capture potentially thousands of kernel compile times. That would be problematic for scuba logging anyway, so let's limit the number we track from the beginning. Arbitrarily chose 25 for now.
* Format the list of compile times as a json string before logging.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
Scuba: https://fburl.com/scuba/dynamo_compile/sandbox/nc4dzm3r

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147022
Approved by: https://github.com/jamesjwu
2025-03-03 19:32:17 +00:00
aade4fbd55 Expose the rendezvous keepalive arguments (#145228)
Enables support for this:

```python
from torch.distributed.launcher.api import LaunchConfig

config = LaunchConfig(
    ...,
    rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```

These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.

Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
2025-03-03 19:11:56 +00:00
a929e11e4f [dynamic shapes][export] ignore when real-tensor fallback fails (#147779)
Summary: uninspired solution to https://github.com/pytorch/pytorch/issues/147402

Test Plan: test_draft_export

Differential Revision: D70132269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147779
Approved by: https://github.com/bobrenjc93
2025-03-03 19:09:56 +00:00
cyy
09291817b2 Fix extra semicolon warning (#148291)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291
Approved by: https://github.com/Skylion007
2025-03-03 18:51:44 +00:00
1c544a9ddd [Inductor-CPP] If all of the activation scale dims are 1, make it a 0D tensor (#147033)
For int8 dynamically quantized activation & int8 quantized weights, add a workaround for some indexing issue that expected an empty index ( so, was expecting a 0D tensor) in epilogue creator when the activation scale was sized [1, 1] by converting it into a 0D tensor.

The issue was discovered while running LLaMA2 quantized with torchao's `int8_dynamic_activation_int8_weight` quantization on CPU with max-autotune enabled (although this error would've occurred regardless).

The final hidden states tensor that's activation to LM head is of shape `[batch_size, sequence_length, hidden_dim]` during decoding. For decoding one token at a time with batch size 1, sequence length is 1. The activation scale is shaped `[1, 1]` (reshaped from `[1, 1, 1]`). However, Inductor epilogue creator expects a 0D tensor in this case (my guess is that the corresponding logic in Inductor expects a 0D tensor if a tensor has only one element, even if it's 1D?).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147033
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
2025-03-03 18:32:27 +00:00
57addfcd58 Significantly speed up save_cache_artifacts (#148227)
While using save_cache_artifacts on internal workloads, we have noticed that repeatedly calling this function after every batch is incredibly expensive. This PR significantly speeds up this function call by opting out of pickle and redesigning serialization algorithm.

Essentially what we want is to be able to call serialize many times without incurring costs from scratch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148227
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148226
2025-03-03 17:28:41 +00:00
3ca1a2564d [BE][MPS] Use copysign for imaginary part of sqrt (#148286)
Also it's tempting trying to replace `a*a + b*b` with `dot(input[index])` but for some reason it results in a slightly different output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148286
Approved by: https://github.com/dcci
ghstack dependencies: #148285
2025-03-03 16:03:54 +00:00
84502baaff [MPS] Fix sqrt and other for torch.chalf (#148285)
Those kernels, instead of being instantiated for half2 (which corresponds to ComplexHalf) were instnatiated for short2, which resuled in the following test
```
% python3 -c "import torch; print(torch.rand(6, device='mps', dtype=torch.chalf).sqrt())"
```
Fail with
```
RuntimeError: Failed to create function state object for: sqrt_complex_half_half
```

As sqrt is not implemented for CPU, add explicit test to `test_sqrt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148285
Approved by: https://github.com/dcci
2025-03-03 16:03:54 +00:00
d57f617844 [Inductor][CPP] Avoid transpose with cpp micro-gemm for FlexAttention (#147069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147069
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/drisspg
ghstack dependencies: #147068
2025-03-03 15:22:11 +00:00
6c089f5da3 ci: move xpu triton build to manylinux 2.28 (#148195)
Follow PR #148129 to remove manylinux builds for triton xpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148195
Approved by: https://github.com/seemethere
2025-03-03 12:31:08 +00:00
165e33531c [Inductor][CPP] Fix the vec codegen for tanh (#148254)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/148241, The previous vectorized code generation for `tanh` used a decomposed implementation, leading to numerical differences that were further amplified by `atan2`. For example, in the given test case after `tanh`, the eager output at `[0,0,11,47]` was `-5.820766091346741e-10`, while the compiled output was `1.4319084584712982e-08`, resulting in different `atan2` outputs of `-2.3561` and `0.7853`. This issue is fixed by switching to the Sleef implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_tanh_atan2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148254
Approved by: https://github.com/malfet, https://github.com/jgong5
2025-03-03 11:46:57 +00:00
118a165ac5 [Inductor][CPP] Add transposed B matrix support for CppMicroGemmFP32Vec (#147068)
* Add transposed B support for CppMicroGemmFP32Vec.
* Add support for cases where N is not divisible by `block_n`.
Expand CppMicroGemmFP32Vec to generate gemm kernel that supports transposed B and N of arbitrary size.
This is the basis for https://github.com/pytorch/pytorch/pull/147069 to get better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147068
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-03 11:08:23 +00:00
6a3a1f96ce Enable XPU for Inductor MM Triton Kernel Benchmark (#148237)
#147620 enabled `force_shape_pad` for triton kernel benchmark. Intel GPU supports this scenario. Hence, we need to enable the case in this PR. Otherwise, there would be a test case regression for Intel GPU as #147620 has been landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148237
Approved by: https://github.com/jansel
2025-03-03 10:09:06 +00:00
b3bb73e11c Separate transpose from memory load/store and add load size support for convert_to_int32 (#147067)
Separate transpose from memory load/store and add load size support for convert_to_int32 to facilitate the expansion for CppMicroGemmFP32Vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147067
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 02:56:16 +00:00
ab81ca5053 [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes.

Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested.

**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 00:56:29 +00:00
608377d341 Revert "[import][inductor] Simplify grid handling (#147583)"
This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e.

Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))
2025-03-03 00:49:32 +00:00
29c2de9ae1 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-02 22:42:31 +00:00
0135f57f4a [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-02 22:42:31 +00:00
edaff88f69 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-02 22:42:31 +00:00
94afb165d9 Revert "[c10d] Add hccl distributed backend to c10d data structures (#146478)"
This reverts commit dae3fbfe9720e83e7e81d41430fb5067221bbed7.

Reverted https://github.com/pytorch/pytorch/pull/146478 on behalf of https://github.com/malfet due to This seems to break ROCM tests, see dae3fbfe97 ([comment](https://github.com/pytorch/pytorch/pull/146478#issuecomment-2692913573))
2025-03-02 21:22:04 +00:00
1106eb0212 [BE] Fix extra semicolon warning (#148284)
Introduced by https://github.com/pytorch/pytorch/pull/146596

I.e. while building locally my log was littered with
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL2d.cpp:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:228:42: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  228 | LOAD_FP32_NON_VECTORIZED_INIT(Half, fp16);
      |                                          ^
2 warnings generated.
[230/1017] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL.cpp:9:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:14:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:228:46: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  228 | LOAD_FP32_NON_VECTORIZED_INIT(BFloat16, bf16);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148284
Approved by: https://github.com/Skylion007
2025-03-02 19:06:46 +00:00
6d70b42810 [BE][Ez]: Update fmt submodule to 11.1.4 (#148264)
This minor release is mostly bugfixes, ABI fixes, and compiler support fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-02 19:00:00 +00:00
95d81d21a6 [MPS] Speedup interpolation (#148277)
First of all, perf claims made in https://github.com/pytorch/pytorch/pull/145581 and https://github.com/pytorch/pytorch/pull/148154 are too good to be true (due to the bug in the script that did not call `torch.mps.synchronize` at the end of the benchmark script, but still slightly better than MPS, probably due to the launch overhead.

And while measure performance correctly, I've noticed that a lot of time is spent on 64-bit integral division of thread_index to get spatial coordinates. Simply downcasting divisior to 32-bit integer (which is also the thread index) speeds it up almost 2x for bilinear and bicubic as could be demonstrated by running following script
```python
import torch
import time
import subprocess
import itertools

def benchmark(device, dtype, mode="bilinear", antialias=False, sf=.5):
    # Create example inputs
    x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype)

    # define kwargs
    kwargs = {"antialias": antialias, "mode": mode, "scale_factor": sf}

    # Skip for unimplemented flavors
    if antialias and mode == "bicubic" and device == "mps":
       return None, "Skip"
    elif antialias and dtype != torch.float32:
       if device == "cpu":
           return None, "Skip"
       outputs_match = None
    else:
        # Check output
        y = torch.nn.functional.interpolate(x, **kwargs)
        z = torch.nn.functional.interpolate(x.cpu(), **kwargs)
        outputs_match = torch.allclose(y.cpu(), z)
        if not outputs_match:
           atol = (y.cpu() - z).abs().max()
           rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
           print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, **kwargs)
    torch.mps.synchronize()
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
for mode,antialias in itertools.product(["bilinear", "bicubic"], [False, True]):
    outputs_match_list = []
    average_time_list = []
    for device in ["mps", "cpu"]:
      for dtype in [torch.float32, torch.float16, torch.bfloat16]:
          outputs_match, average_time = benchmark(device, dtype, mode=mode, antialias=antialias)
          outputs_match_list.append(str(outputs_match))
          average_time_list.append(average_time)

    print(f"\nBenchmarking Results (collected on {brand_string}) for {mode} interpolation {'with antialias' if antialias else ''}:")
    print("-"*40)
    print("Device            :                MPS        |               CPU")
    print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16")
    print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
    print(f"Average Time (us) :", "  |".join(average_time_list))
```

Before
```
Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  292.0  | 264.7  | 267.9  | 289.1  | 230.9  | 309.1
atol=1.430511474609375e-06 rtol=0.11363636702299118

Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias:
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  False |  False |  True  |  None |  None
Average Time (us) :  698.3  | 684.2  | 683.8  | 851.0  |Skip  |Skip
atol=2.086162567138672e-06 rtol=0.019750799983739853

Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  True  |  True  |  True  |  True  |  True
Average Time (us) :  314.3  | 301.0  | 298.8  | 681.5  | 616.7  | 833.7
```

After
```
Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  119.9  |  98.9  |  98.6  | 289.8  | 231.9  | 308.5
atol=1.430511474609375e-06 rtol=0.05681818351149559

Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias:
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  False |  False |  True  |  None |  None
Average Time (us) :  541.9  | 531.1  | 531.0  | 846.8  |Skip  |Skip
atol=2.0265579223632812e-06 rtol=0.008604463189840317

Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  True  |  True  |  True  |  True  |  True
Average Time (us) :  314.3  | 301.0  | 298.8  | 681.5  | 616.7  | 833.7
```

TODO:
 - Figure out if this ops make more sense as 3D jobs with n and c channels dispatch as one more dimension
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148277
Approved by: https://github.com/Skylion007
2025-03-02 17:13:52 +00:00
cyy
9aa897b992 Remove unnecessary tensor clone (#148159)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148159
Approved by: https://github.com/Skylion007
2025-03-02 16:21:39 +00:00
1d7397a2d0 [Inductor] Avoid tensor slice overflow for large step (#147433)
Fixes #147071

Currently, if step is a value very close to INT64_MAX, the calculation of slice output length will overflow. This PR tries to fix this problem and thus fix #147071.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147433
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-02 16:07:15 +00:00
9c506aa8a6 [aotinductor] add option to disable runtime assertions (#146462)
A recent user experience is like this:
* User runs AOTI lowering, it's successful.
* They take AOTI model and run it with some sample inputs. Everything runs well
* Then they boot up a serving test that loads the AOTI model and runs it with a set of sample requests.
* They see that some of the requests fail. The logs show them this:
* AOTInductorModel run failed with input spec: [1, 32]:c10::BFloat16, [2]:long ...
  * Error: u45 >= 2
* To the untrained eye, "AOTInductorModel run failed" is all they see. But, the true reason is Error: u45 >= 2

However, the assertion isn't always correct.
* In fact, u45 can actually be 0.
* So, why did AOTI say u45 ≥ 2? It's a two-piece combo:
* With 0/1 Specialization, the ShapeEnv creates symbolic shapes (e.g. s0) with a default value-range of [2, inf]
* In the graph, Dynamo traces torch.mul(A, B) where A is [s0, ...]and B is [u45, ...]. So, Dynamo learns Eq(s0, u45).
* Therefore, u45 also has a range of [2, inf]. Hence, the incorrect runtime assertion.

So, the motivation for this PR is to add an option to disable the logging. If you run into a situation like this. However, another way to avoid this is to call `mark_unbacked()` on all the dynamic dims.

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146462
Approved by: https://github.com/desertfire, https://github.com/22quinn
2025-03-02 09:14:58 +00:00
26358fa2d8 Add AppendingByteSerializer class (#148226)
This PR adds a new util class that enables efficient appending of sequential byte data with custom serialization and deserialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148226
Approved by: https://github.com/aorenste
2025-03-02 08:20:58 +00:00
b59776d857 [import][inductor] Simplify grid handling (#147583)
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Note the attached diff contains some minor fbcode-only changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-03-02 07:31:07 +00:00
dae3fbfe97 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang, https://github.com/guangyey
2025-03-02 05:13:48 +00:00
6e10471966 [ci] disable cudagraph for tts_angular on dashboard (#148221)
tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular.

[Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221
Approved by: https://github.com/eellison
2025-03-02 03:31:19 +00:00
de7af81f18 [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-03-02 03:25:28 +00:00
ce2f680e00 [fr] Added protection against missing stack frames in fr (#148203)
Summary: We have quite a while failures due to this unprotected access. https://fburl.com/scuba/ai_rca_debug_tracing/qtnb63qf

Test Plan:

Reviewed By: fduwjj

Differential Revision: D70358287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148203
Approved by: https://github.com/fduwjj
2025-03-02 01:03:49 +00:00
19de523de6 [MPS] metal unary kernel for sqrt (#148272)
Issue #148219 highlighted the high dispatch times of ops which ran with MPS Graph on smaller tensors. This PR rewrites the sqrt with metal kernel to mitigate that issue

## Speedups:

Matrix size means NxN matrix here.
![speedup_sqrt](https://github.com/user-attachments/assets/db0a705b-1a0e-42b4-bd42-4e7960415c81)

Code to generate the times(needs building the torch with old time and new time):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [1, 100, 1000, 10_000]
num_runs = 1000
warmup_runs = 3

def run_sqrt(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = torch.sqrt(A)
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.rand((n, n), dtype=torch.float32, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_sqrt(A_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_sqrt(A_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('sqrt_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148272
Approved by: https://github.com/malfet
2025-03-02 00:45:45 +00:00
1a6883759d Fix macro for bit_cast in c10/util/bit_cast.h - one line change (#148265)
Fixes #148263.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148265
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-03-01 20:55:31 +00:00
1919e0de9a Revert "stage 1 of depreate silent fallback of tuning gemm (#147798)"
This reverts commit 297c00264e54cfb192f289e23a41775b81cb9cb8.

Reverted https://github.com/pytorch/pytorch/pull/147798 on behalf of https://github.com/wdvr due to failing internal builds, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147798#issuecomment-2692390551))
2025-03-01 20:04:23 +00:00
82603fd7d2 introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk, https://github.com/wdvr
2025-03-01 19:57:54 +00:00
5301710b15 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69945678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-03-01 19:46:13 +00:00
0ff2e6a85a Fix None and equal_to_1 arguments issue in Triton kernel generated by AOTI (#148102)
Summary:
When a Triton kernel has arguments with None values followed by arguments with value 1, AOTI attempts to remove the None arguments and update the indices of the equal_to_1 arguments in triton_meta["configs"]. However, if the same kernel is called multiple times, this optimization process is repeated. Prior to this diff, the indices of equal_to_1 arguments from subsequent calls (second and later) were based on the updated indices from the previous call, resulting in incorrect behavior.
This diff aims to localize the updated indices for equal_to_1 arguments within the optimization process of the current call, ensuring accurate and consistent results.

Test Plan:
Unit Test:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_triton_kernel_with_none_inputs_and_equal_to_1_arg
```

Differential Revision: D69998314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148102
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
2025-03-01 18:38:33 +00:00
2b86309da3 separate f16 vectorized class from bf16 (#146596)
Separating the f16 vectorized class into a different file from the bf16 vectorized class in order to be able to add a new bf16 SVE vectorized class in https://github.com/pytorch/pytorch/pull/143666. This is required as we would need to exclude the current bf16 class in order to use the sve bf16 class but still include the current f16 vectorized class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146596
Approved by: https://github.com/malfet
2025-03-01 18:22:32 +00:00
8e004865dd Revert "introduce dynamism library (#147981)"
This reverts commit 1c1bf410ecdeac8d240e15bf8c33c0f00fab0673.

Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/malfet due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2692351017))
2025-03-01 18:16:52 +00:00
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
d23051f29b [Inductor] Support parallel reduction for GroupNorm (#144020)
Summary:
Support parallel reduction for GroupNorm by optimizing the parallelization heuristics: When the range of the first inner loop is much larger than the range of all outer loops, change the starting depth of parallelization to the first inner loop.
I tested the Inductor benchmark with this PR on CPU. One torchbench model(pytorch_CycleGAN_and_pix2pix) achieved ~45% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) achieved ~2% performance improvement.

Example:
```
import torch
import torch.nn as nn

class GN(nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

x = torch.randn(2, 64, 168, 168).to(memory_format=torch.channels_last)
m = GN(2, 64).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    out = compiled_m(x)
```

Generated code:

- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       float* out_ptr3,
                       float* out_ptr4)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(56448L));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L))
                            {
                                {
                                    if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L)))
                                    {
                                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32L*x1 + 64L*x2 + 1806336L*x0), static_cast<int64_t>(16));
                                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                                    }
                                }
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        #pragma omp single
        {
            {
                #pragma GCC ivdep
                for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
                    {
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                        {
                            {
                                if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                                {
                                    auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2L*x0)];
                                    auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                                    auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2L*x0)];
                                    auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                                    auto tmp1 = static_cast<float>(903168.0);
                                    auto tmp2 = tmp0 / tmp1;
                                    auto tmp3 = static_cast<float>(1e-05);
                                    auto tmp4 = decltype(tmp2)(tmp2 + tmp3);
                                    auto tmp5 = 1 / std::sqrt(tmp4);
                                    auto tmp7 = at::vec::Vectorized<float>(tmp5);
                                    auto tmp8 = tmp7 * tmp6;
                                    auto tmp10 = decltype(tmp9)(-tmp9);
                                    auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                    auto tmp12 = tmp11 * tmp8;
                                    auto tmp14 = tmp12 + tmp13;
                                    tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                                    tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                                }
                            }
                        }
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L))
                {
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L))
                    {
                        {
                            if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L)))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp2 = tmp0 * tmp1;
                                auto tmp4 = tmp2 + tmp3;
                                tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')
```

- After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       float* out_ptr3,
                       float* out_ptr4)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            #pragma GCC ivdep
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
            {
                {
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>();
                    }
                    Welford<float> tmp_acc0_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        tmp_acc0_arr[i] = Welford<float>();
                    }
                    Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        masked_tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>();
                    }
                    #pragma omp parallel num_threads(56)
                    {
                        int tid = omp_get_thread_num();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(1008L));
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>();
                        Welford<float> tmp_acc0_local = Welford<float>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>();
                        #pragma omp for
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L))
                            {
                                {
                                    if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L)))
                                    {
                                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32L*x1 + 64L*x2 + 1806336L*x0), static_cast<int64_t>(16));
                                        tmp_acc0_vec_local = welford_combine(tmp_acc0_vec_local, tmp0, &wrecps0);
                                    }
                                }
                            }
                        }
                        tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local;
                        tmp_acc0_arr[tid] = tmp_acc0_local;
                        masked_tmp_acc0_vec_arr[tid] = masked_tmp_acc0_vec_local;
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp_acc0_vec_arr[tid]);
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]);
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, masked_tmp_acc0_vec_arr[tid]);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            #pragma GCC ivdep
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
            {
                for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                        {
                            auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2L*x0)];
                            auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                            auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2L*x0)];
                            auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                            auto tmp1 = static_cast<float>(903168.0);
                            auto tmp2 = tmp0 / tmp1;
                            auto tmp3 = static_cast<float>(1e-05);
                            auto tmp4 = decltype(tmp2)(tmp2 + tmp3);
                            auto tmp5 = 1 / std::sqrt(tmp4);
                            auto tmp7 = at::vec::Vectorized<float>(tmp5);
                            auto tmp8 = tmp7 * tmp6;
                            auto tmp10 = decltype(tmp9)(-tmp9);
                            auto tmp11 = at::vec::Vectorized<float>(tmp10);
                            auto tmp12 = tmp11 * tmp8;
                            auto tmp14 = tmp12 + tmp13;
                            tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                            tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                        }
                    }
                }
            }
        }
    }
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L))
                {
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L))
                    {
                        {
                            if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L)))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp2 = tmp0 * tmp1;
                                auto tmp4 = tmp2 + tmp3;
                                tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144020
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-01 17:11:50 +00:00
cyy
8bf3920279 Remove unneeded Clang-tidy suppression (#148246)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148246
Approved by: https://github.com/Skylion007
2025-03-01 16:51:54 +00:00
1c1bf410ec introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk
2025-03-01 14:57:06 +00:00
3a0c9f7f9d [MPS] Fix SDPA crash (#148239)
If operation is invoked with mask twice it will crash, as mask expansion logic was implemented inside cache creation block, which is executed only once for all shapes

Fixes https://github.com/pytorch/pytorch/issues/148194 which is a regression introduced by https://github.com/pytorch/pytorch/pull/147545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148239
Approved by: https://github.com/dcci
2025-03-01 13:06:51 +00:00
735d7b1af6 [EZ][BE] Increase tolerances for interpolate op (#148224)
Not sure why tolerances were set like that, this logic was added in https://github.com/pytorch/pytorch/pull/104181 without much explanation
But if I'm to make a guess, it's likely due to the inaccuracy of bilinear op, that has since been replaced by shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148224
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148154, #148187, #148211
2025-03-01 13:03:59 +00:00
762724f3d0 [Break XPU][Inductor] Generalize device-bias code and fix test_graph_partition for XPU (#148178)
This PR generalized the device-bias code introduced by #147038 . And align the behavior between XPU and CUDA on add + mm + pointwise pattern (for XPU, from addmm + pointwise to mm + fused_add_pointwise) , which fix the failed test case `test_graph_partiton` on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148178
Approved by: https://github.com/benjaminglass1, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #148155
2025-03-01 10:59:55 +00:00
ab78bf5c66 [Break XPU][Inductor UT] Avoid custom op registration conflicts in test_auto_functionalize.py. (#148155)
Fix #148148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148155
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-01 10:59:55 +00:00
2f1b8e0fe2 [DTensor][Test] Add a test to demonstrate current dtensor view behavior if redistribution happens (#148015)
This does not fix the view op issue when redistribution happens. We want to add a test to demonstrate/record the issue, in which the distributed behavior does not match up with single device behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148015
Approved by: https://github.com/XilunWu
2025-03-01 10:24:40 +00:00
191c9bd013 Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)"
This reverts commit b8efebe57d05a87be5b0f304218d2af7bb2bf6c6.

Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/davidberard98 due to looks like another lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2692042859))
2025-03-01 07:43:58 +00:00
fe3b9e3764 [Inductor] optimize the heuristics of outer loop fusion (#147523)
Summary:
Optimize the heuristics of outer loop fusion: When the range of the first inner loop is much larger than the range of all outer loops, do not fuse the outer loops and fallback to standard codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147523
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-01 06:50:04 +00:00
b8efebe57d [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-03-01 06:38:39 +00:00
fd16311e7f [inductor][subgraph] Plumbing to get ShapeAsConstantBuffer from subgraph to main graph output (#147559)
I am unable to create a test case that fails without the next PR. The idea is to have a symint which is returned by the inner subgraph and then returned by the forward graph after partitioning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147559
Approved by: https://github.com/eellison
2025-03-01 06:17:11 +00:00
c87097e74a [triton 3.3] Fix inductor/test_profiler.py test (#148230)
test_inductor_profiling_kernel_names_pointwise is checking that the profiler correctly records the input shapes to the kernel. After triton 3.3, we get a different number of args (because the constexpr args are passed in, from the python perspective). This just patches the test to pass in either case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148230
Approved by: https://github.com/drisspg, https://github.com/YUNQIUGUO
2025-03-01 04:27:49 +00:00
9377a32cd1 [Inductor][NFC] Remove unused functions from compile_tasks.py (#147564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147564
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2025-03-01 03:44:43 +00:00
baf1c8fcdc Revert "introduce dynamism library (#147981)"
This reverts commit 6eff6b28e4d09cbf632f79502a8e317bf5b53c34.

Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2691906065))
2025-03-01 03:43:01 +00:00
493cd97af5 add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134)
Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134
Approved by: https://github.com/eqy
2025-03-01 02:59:48 +00:00
6eff6b28e4 introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk
2025-03-01 02:49:16 +00:00
08434df1f2 [MPS] fix empty place holder error for smooth l1 loss (#148133)
Fixes #123171

And parametrizes the tests for it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148133
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-01 02:32:45 +00:00
02c5f21541 [Inductor] fix AOTInductorTestABICompatibleGpu.test_triton_kernel_weird_param_order with new Triton (#148011)
In this case, the parameters have already been filtered [here](201666d77d/torch/_inductor/codegen/cpp_wrapper_gpu.py (L335)) and subsequent filtering is not only unnecessary, it breaks the code, since the positions of the parameters change after filtering. For this test, for example, the second filtering discarded `buf0`.

For example:
```python
(Pdb) triton_meta["signature"]
{'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'n_elements': 'i32', 'BLOCK_SIZE': 'constexpr', 'out_ptr': '*fp32'}
(Pdb) call_args
['arg0_1', 'arg0_1', '256L', 'buf0']
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148011
Approved by: https://github.com/davidberard98
2025-03-01 01:21:20 +00:00
338ed67a1e [inductor] Implement max_pool2d_with_indices as a reduction for large window sizes (#147876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147876
Approved by: https://github.com/eellison
2025-03-01 01:07:01 +00:00
230a3b0f83 Add cuda 11.8 guard for cufile preload (#148184)
Follow up after https://github.com/pytorch/pytorch/pull/148137
Make sure we don't try to load cufile on CUDA 11.8

Test:
```
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu118'
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148184
Approved by: https://github.com/mikaylagawarecki
2025-03-01 01:01:04 +00:00
2544afaa1a [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364
Approved by: https://github.com/fduwjj
2025-03-01 00:57:37 +00:00
5d297f7a34 [MPS][BE] Combine two upsample_kernel_out_template into one (#148211)
- First, by stopp inverting sizes and strides, i.e. passing them as is, but reading them in inverse order in the shader as 1st stride of 4D tensor is one used for batches, 2nd for channels and 3rd and 4th for spatial coordinates
- Pass `scales` as float2 even in linear tensor
Above  allows one to collide two flavors `upsample_kernel_out_template` into one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148211
Approved by: https://github.com/dcci
ghstack dependencies: #148154, #148187
2025-03-01 00:39:26 +00:00
clr
83fb974b5d scriptfunction: Make sure we have valid __name__ and __qualname__ (#147906)
It's not fully clear why these are not being created, but you can definitely
    reproduce this in code. `__name__` is fun, since there appears to be no way to
    explicitly set it on the pybind11 layer or c++ layer. I've set this in the
    python wrapper code (which works correctly). But let me know if people feel
    strongly and want us to go explicitly cast to python within the cpp functions
    and set it there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147906
Approved by: https://github.com/jansel
ghstack dependencies: #147894
2025-02-28 23:25:47 +00:00
1ae7cc41ca Define __all__ for torch.utils.tensorboard (#147550)
Fixes the issue:

```python
import torch.utils.tensorboard
torch.utils.tensorboard.FileWriter  # pyright: "FileWriter" is not exported from module "torch.utils.tensorboard"
torch.utils.tensorboard.RecordWriter  # pyright: "RecordWriter" is not exported from module "torch.utils.tensorboard"
torch.utils.tensorboard.SummaryWriter  # pyright: "SummaryWriter" is not exported from module "torch.utils.tensorboard"
```

The [docs page for `torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147550
Approved by: https://github.com/albanD
2025-02-28 23:06:11 +00:00
3a69dee955 [Submodule][FlashAttention] Bump to 2.7.4 (#148147)
# Summary
This makes me happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147
Approved by: https://github.com/Skylion007
2025-02-28 22:40:02 +00:00
83ec7cdcd4 Fix recompile reason logging (#148200)
for the following test case

```
        @torch.compile(dynamic=False, backend=cnts)
        def fn(x, y, z):
            return x * y * z[0]

        fn(1, torch.randn(1), {0: torch.randn(1)})
        fn(2, torch.randn(2), {0: torch.randn(2)})
        fn(3, torch.randn(3), {0: torch.randn(3)})
        fn(4, torch.randn(4), {0: torch.randn(4)})
        fn(5, torch.randn(5), {0: torch.randn(5)})
```

previously we would log

```
0/0: L['x'] == 1
0/0: L['x'] == 1
0/0: L['x'] == 1
0/0: L['x'] == 1
```

but after this change we now log

```
0/0: L['x'] == 1
0/1: L['x'] == 2
0/2: L['x'] == 3
0/3: L['x'] == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148200
Approved by: https://github.com/xmfan
2025-02-28 22:33:37 +00:00
40b3e4a358 [dynamo] expose code execution strategy to python (#148020)
@anijain2305 this can be used to mark a code object to be skipped/run-only (recursively) while tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148020
Approved by: https://github.com/jansel
2025-02-28 21:59:12 +00:00
e74fdbe6d0 [inductor] ignore block ptr advancements for removed buffers (#148087)
Follow up to https://github.com/pytorch/pytorch/pull/147193. Some buffers are removed only when the kernel context is exited so defer the lines instead.

Added `use_block_ptr` as a parameter to test case that fails if run with block ptrs enabled.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148087
Approved by: https://github.com/jansel, https://github.com/eellison
2025-02-28 21:31:15 +00:00
d174562487 [MPS][BE][EZ] Aggregate macros (#148187)
Refactor `INSTANTIATE_UPSAMPLE_BILINEAR2D(DTYPE)`, `INSTANTIATE_UPSAMPLE_BICUBIC2D(DTYPE)` and `INSTANTIATE_UPSAMPLE_BILINEAR2DAA(DTYPE)` use common `INSTANTIATE_UPSAMPLE2D`
Then combine multiple invocations into `INSTANTIATE_UPSAMPLE_ALL`

I.e. functionally it's a no-op, but achieves the same with fewer lines of code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148187
Approved by: https://github.com/Skylion007
ghstack dependencies: #148154
2025-02-28 21:30:00 +00:00
4995e058bf [user-triton] handle inline_asm_case (#148043)
Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value

Test Plan:
../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel

```
test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True)
ok

----------------------------------------------------------------------
Ran 2 tests in 0.706s

OK
```

Differential Revision: D69878591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043
Approved by: https://github.com/zou3519
2025-02-28 20:52:51 +00:00
6f91720e1c [inductor][ck] manual kBatch heuristic (#148118)
Summary:
# Why

Leverage kBatch parameter for large splitK examples for CK for better than ATEN performance

# What

replace default kBatch = 1 with a manual heuristic

- if K > 16 * max (M,N)
- leverage k_per_block, and K and number of SMs on the chip
- upper bound to 128, lower bound to 1

This is better than defaulting to 1, cheap to calculate, and shows performance beyond ATEN

This is of course subject to change and improvement

Test Plan:
with minor modifications to to run torch.mm on the shape `M, N, K = 2048, 2048, 524288`

```
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

```
AUTOTUNE mm(2048x524288, 524288x2048)
  rocm_ck_gemm_template_49 10.4972 ms 100.0%
  rocm_ck_gemm_template_8 10.6132 ms 98.9%
  rocm_ck_gemm_template_9 10.6907 ms 98.2%
[...]
  mm 18.9880 ms 55.3%
```

Reviewed By: ColinPeppler

Differential Revision: D70224591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148118
Approved by: https://github.com/ColinPeppler
2025-02-28 20:36:16 +00:00
48c55a66ec [ROCm] Move ROCm unstable MI300 jobs back to stable (#146675)
Fixes #145790
Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners.

This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](https://github.com/pytorch/pytorch/pull/145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146675
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-02-28 20:34:27 +00:00
6778084531 [inductor][cutlass] Environment variables for allow/denylist (#148161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148161
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-02-28 20:33:10 +00:00
5a1954eb93 [Inductor-CPU] Fix broken int8 WoQ GEMM AMX implementation in main (#147895)
#146843 broke int8 WoQ GEMM's (for BF16 activation) AMX ISA implementation in the main branch.
UT: `python test/inductor/test_cpu_select_algorithm.py -v -k woq`

The issue remained undetected because in case of templated kernel compilation failure, the auto-tuning infra marks its runtime as `inf`, and the op against which it was being benchmarked is used, so UTs didn't fail even on machines that support AMX ISA.

`test/inductor/test_cpu_select_algorithm.py` UTs checked the value of the `select_algorithm_autotune` counter, which only counts how many ops were selected for autotuning against their templated codegened counterparts.

@leslie-fang-intel advised using a new counter. I added `counters["inductor"]["cpp_templated_kernel_counter"]`, which is incremented after a codegened kernel's compilation, so it'd help catch breakage scenarios in which a templated kernel could not be codegened due to a compilation failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147895
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-02-28 20:20:45 +00:00
clr
e0e516c554 Don't crash when we call __qualname__ on torch._C.ScriptFunction (#147894)
We've root caused this to correctly throwing attribute error on ScriptFunction
when missing attributes are caused. This PR will fix crashes that are showing
up. I'm going to stack a second PR to fix torch._c.ScriptFunction just being a
very badly behaving python object (which should also fix this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147894
Approved by: https://github.com/jansel
2025-02-28 20:15:38 +00:00
297c00264e stage 1 of depreate silent fallback of tuning gemm (#147798)
Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/)

context:
https://github.com/pytorch/pytorch/issues/147479

For the most part, this should not change the behavior.

For int_mm, I also removed
```
    # TODO: Re-enable eager mode implementation once cuBLAS is fixed
    if use_cutlass or use_triton_template(layout, enable_int32=True):
        choices = []
```
because I think it is unwanted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798
Approved by: https://github.com/eellison
2025-02-28 19:51:55 +00:00
ebc3f27bf4 Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)"
This reverts commit 6e037ac41c095dfdb37fdd4b36bf8ec2ebf84bf1.

Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/wdvr due to lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2691421540))
2025-02-28 19:44:54 +00:00
42aeb5d259 Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners (#145504)
E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120
```
Beginning upload of artifact content to blob storage
Error: An error has occurred while creating the zip file for upload
Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log'
/home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459
    throw new Error('An error has occurred during zip creation for the artifact');
    ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145504
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-02-28 19:16:28 +00:00
6e037ac41c [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-02-28 18:51:42 +00:00
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
982d7ba3ef [while_loop][inductor] relax the constraint that all inputs must be on the same device (#148019)
Previously, we require all inputs of while_loop to be on the same device. However, there're use cases where we want to keep some of the inputs on cpu while others on gpu e.g. an loop_idx on cpu will save the gpu to device copies. This PR relaxes the constraint and only check if carry and input at the same position have the same device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148019
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-28 18:27:03 +00:00
2d2f60bdda [cond] support mismatched output in inductor (#147567)
In this PR, we extract `codegen_unbacked_symbol_defs` of FallbackKernel out as a `codegen_unbacked_symbol_defs_for_outputs` method in wrapper. With it,  HOPs can support the case where the subgraph returns a tensor with unbacked symints. This PR only do it for cond, we'll have follow up PRs for others (e.g. while_loop) as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147567
Approved by: https://github.com/jansel
2025-02-28 18:26:48 +00:00
d765077004 [cutlass backend] Sort the list of ops for better repro (#148047)
Differential Revision: [D70298051](https://our.internmc.facebook.com/intern/diff/D70298051/)

This only affects anything if `cutlass_max_profiling_configs` is used. I believe cutlass_max_profiling_configs is more of a testing config.

Problem is when we get the configs from cutlass_library, the ops can come in different orders.

Motivation is to make repro small issues easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148047
Approved by: https://github.com/chenyang78, https://github.com/coconutruben
2025-02-28 18:04:10 +00:00
790ec756ee [cutlass backend] Check if len(timings) == len(choices) before skipping precompile (#148050)
Differential Revision: [D70298908](https://our.internmc.facebook.com/intern/diff/D70298908/)

Mostly from @coconutruben observation. Right now, we skip precompilation if we find **some** timings. That sounds like a bug. Most of the time it is fine, since we don't change the number of configs and triton compilation doesn't take too long. But it is devastating for cutlass backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148050
Approved by: https://github.com/coconutruben
2025-02-28 17:58:58 +00:00
e5e31050d3 [MPS] Implement linear1d as shader (#148154)
And get rid of MPS call, as for some reason implementation via MPSGraph
API call is 100x+ times slower that Metal shader, at least according to
the following benchmark
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(3, 5, 65536, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="linear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS        |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16  ")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results after the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    2.5  |   2.1  |   2.2  | 161.4  | 115.0  | 161.1
```
And before the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  354.0  | 336.0  | 332.4  | 145.5  | 114.7  | 148.3
```

Fixes https://github.com/pytorch/pytorch/issues/144245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148154
Approved by: https://github.com/dcci
2025-02-28 16:47:42 +00:00
b5cd4ac950 [torchgen] Add support for schema with namespace (#148038)
Fixes https://github.com/pytorch/executorch/issues/8711

In ExecuTorch when we try to parse the following schema:

```
aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor
```
Repro:

```python
from torchgen.model import FunctionSchema
native_schema = FunctionSchema.parse("aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor")
```
It's failing because `BaseOperatorName` categorizes it to be a
inplace operator.

I understand we are not supposed to pass in namespace "aten::" into
`FunctionSchema.parse()` but unfortunately ExecuTorch requires this
feature to work.

This PR adds a new `namespace` attribute to `BaseOperatorName` and makes
sure the rest of the stack works as before, if a schema without
namespace  is passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148038
Approved by: https://github.com/bdhirsh
2025-02-28 16:41:50 +00:00
e593288859 ci: Remove manylinux builds for triton, except for XPU (#148129)
We're dropping regular old manylinux so let's drop it here too

Relates to #123649

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148129
Approved by: https://github.com/Camyll, https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
ghstack dependencies: #148126
2025-02-28 16:23:18 +00:00
4708cfdbd9 Support whitelist of dynamic sources (#147979)
This PR introduces the ability to whitelist sources as dynamic. This is particularly useful for large models with graph breaks, as you can keep the dynamism across graph breaks since source names stay consistent. Additionally you can use this to mark ints as dynamic.

NB: I intentionally didn't complicate the interface by supporting specification of per dimension dynamism. There is virtue in keeping true to the standard way of representing sources (eg. L['x']). If we find in practice that we need more more fine grained control, we can explore further affordances at that time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147979
Approved by: https://github.com/Mingming-Ding
2025-02-28 15:43:14 +00:00
0a948f705b [Dynamo] Fix AssertionError when dynamo traces torch.functional.xxx() functions (#148075)
Fixes #147840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148075
Approved by: https://github.com/yanboliang
2025-02-28 15:09:11 +00:00
1db3c58fab Remove manylinux 2014 artifacts (#148135)
1. Switch Magma build to Manylinux 2.28 base
2. Use manylinux 2.28 as default in populate_binary_env.sh
3. Remove manylinux 2014 docker builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148135
Approved by: https://github.com/malfet
2025-02-28 13:43:14 +00:00
1cb4e2df65 [BE][PYFMT] migrate PYFMT for torch._inductor to ruff format (#144550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550
Approved by: https://github.com/jansel
2025-02-28 13:33:19 +00:00
34d726011f [dynamo] update data-dependent branching graph break messages (#147912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147912
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #147494, #147872
2025-02-28 12:30:06 +00:00
4106aa33eb [dtensor][fix] fix _scaled_dot_product_flash_attention sharding (#148125)
### Summary
https://github.com/pytorch/pytorch/pull/146372/ changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at 40ad5e01df/torch/distributed/tensor/_ops/_matrix_ops.py (L232)

### Test
`pytest test/distributed/tensor/test_attention.py`

### Follow-up
It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148125
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-02-28 09:26:43 +00:00
af720cd5a7 [Intel GPU] Decompule Intel GPU oneDNN from other backends (#147926)
# Motivation
Currently, Intel GPU is moving forward rapidly with the development of feature. We(Intel GPU) want an independent version control over oneDNN component so as to quickly adopt the optimization or bug fixing provided by oneDNN team.

This PR does not change the behaviors of other backends like Intel CPU, ARM. They can keep using the stable version contained in `third_party/ideep`.

# Detail

At compilation time, we will `git clone` oneDNN via  URL `https://github.com/oneapi-src/oneDNN` and checkout to the tag/commit that Intel GPU backend prefers. This feature is supported by CMake `Externalproject_add` command.
Following is a build log example:
```bash
[11/60] Performing download step (git clone) for 'xpu_mkldnn_proj'
Cloning into 'xpu_mkldnn_proj'...
HEAD is now at 5e92240360 meta: updated citation file
[12/60] Performing update step for 'xpu_mkldnn_proj'
-- Already at requested tag: v3.7
[13/60] No patch step for 'xpu_mkldnn_proj'
```
The log demonstates that, we explicitly download the source files and checkout to a specific tag. The source file of oneDNN is located at `build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj`

# Runtime verification
Running UT for CPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit fc3f17ad469b8a6da7192ae12d32625faa509f1e)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine
```

Runnint UT for Intel GPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:DPC++
onednn_verbose,v1,info,gpu,engine,sycl gpu device count:2
```

We can see that, Intel GPU would uses commit `5e922` (tag v3.7), while CPU uses `fc3f17`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147926
Approved by: https://github.com/EikanWang

Co-authored-by: leizhenyuan <zhenyuan.lei@intel.com>
2025-02-28 07:42:06 +00:00
3a58a04898 Build a storage reader/writer to write checkpoints in HF format (#148089)
Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere

Test Plan:
signals pass
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage

Differential Revision: D70324090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089
Approved by: https://github.com/saumishr
2025-02-28 07:38:10 +00:00
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
4e160d5fd9 [triton 3.3] Fix aoti cpp wrapper remaining 5 issue. (following #148051) (#148117)
Summary:
Fix the following 5 on a100:

- test_foreach_cpp_wrapper_cuda_gpu_wrapper

- test_enable_dynamic_shapes_cpp_wrapper_cuda_gpu_wrapper

- test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_gpu_wrapper

- test_enable_dynamic_shapes_cpp_wrapper_cuda_dynamic_shapes_gpu_wrapper

- test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_dynamic_shapes_gpu_wrapper

Test Plan:
oss :

```
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCH_LOGS="+inductor, output_code" TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CPLUS_INCLUDE_PATH=/usr/local/cuda-12.6/include:$CPLUS_INCLUDE_PATH python test/inductor/test_gpu_cpp_wrapper.py -k test_foreach_cpp_wrapper_cuda_gpu_wrapper
```

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148117
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
2025-02-28 06:56:30 +00:00
ea12fc8a9f Revert D70262395 (#148164)
Summary:

This reverts #147804 due to internal revert.

---
This diff reverts D70262395

Reviewed By: RossMcKenzie

Differential Revision: D70318024

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148164
Approved by: https://github.com/xmfan
2025-02-28 06:39:48 +00:00
baba7beed2 [dynamo] add context manager debug information to graph breaks (#147872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147872
Approved by: https://github.com/zou3519
ghstack dependencies: #147494
2025-02-28 06:23:28 +00:00
4caeede799 [dynamo] more better error messages [3/N] (#147494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147494
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-02-28 06:23:28 +00:00
bc362cc15a Move expanded dim require_exact_stride handling to api from sdpa lowering (#148101)
See issue: https://github.com/pytorch/pytorch/issues/147156#issue-2852362217.

Original tests from https://github.com/pytorch/pytorch/pull/146054 should cover these changes, and I tested that the perf on https://github.com/pytorch/pytorch/issues/145760 remains fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148101
Approved by: https://github.com/zou3519
2025-02-28 06:02:18 +00:00
cyy
b0dfd242fa Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 05:53:19 +00:00
3b4b23ab0b [BE][Ez]: Remove extra copy in dtensor parallel loss (#148096)
Remove an extra copy of the input to `_log_softmax` when there is a dtype and memory format change. Fuse the copies instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148096
Approved by: https://github.com/jansel, https://github.com/wconstab
2025-02-28 05:42:32 +00:00
9b7130b8db Clean temporary directory at exit (#147813)
Issue: A temporary directory is created in [pytorch/torch/distributed/nn/jit/instantiator.py](https://github.com/arthurlw/pytorch/blob/clean-temp-directory-at-exit/torch/distributed/nn/jit/instantiator.py) but is never cleaned up, leading to a ResourceWarning on program exit.

Solution: Registered an `atexit` handler to properly clean up the temporary directory when the program exits.

Fixes #147744

**Line 23 in [0a49f8f](0a49f8fd3d)**
```python
23  atexit.register(_TEMP_DIR.cleanup)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147813
Approved by: https://github.com/H-Huang
2025-02-28 04:12:23 +00:00
760921a7d8 [MPS] Add inductor support for the entr() operator. (#148128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148128
Approved by: https://github.com/jansel, https://github.com/malfet
2025-02-28 03:33:22 +00:00
eb9c127341 [dynamo][optimizers] Install ID_GUARDED tensors into the Fx graph (#147824)
Earlier, with inline flag we were lifting id-guarded tensors to the inputs to the Fx graph. But this offers no benefit. Main idea behind lifting parameters as inputs was to reuse the compilation units across many instances of the nn-module. However, if we are guarding on the `id`, we are explicitly specializing the compiled artifact to the parameter.

This PR installs the parameters back into the graph. The benefit is removal of all pre-graph bytecode to extract the id-guarded tensors from locals/globals. This increases speedup from 1.67x to 1.75x for an internal model that has large number of optimizer parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147824
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-02-28 03:22:11 +00:00
926b7b5027 Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705)"
This reverts commit 40ad5e01dff05c7d64e070fb01683820e678f788.

Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))
2025-02-28 03:04:38 +00:00
3ce352e389 [BE][PYFMT] migrate PYFMT for torch._dynamo to ruff format (#144549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549
Approved by: https://github.com/jansel
2025-02-28 03:03:53 +00:00
edc5bf91d2 [Intel GPU] Add synchronize() in torch.utils.benchmark (#147835)
When following https://pytorch.org/tutorials/recipes/recipes/benchmark.html on XPU, I notice that the device it is not synchronized in the benchmark. This PR tries to fix this and align the behavior with CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147835
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-02-28 02:58:17 +00:00
0edb2da4a4 [dynamo] add sourceless builder for types.MethodType (#147880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880
Approved by: https://github.com/jansel
2025-02-28 02:30:04 +00:00
30375cb326 Fix minor typo in python_nccl (#148088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148088
Approved by: https://github.com/Skylion007
2025-02-28 00:47:09 +00:00
481a57bc37 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-28 00:47:03 +00:00
c6d1038aaa only print GraphModule during fx.Interpreter errors if valid (#148090)
Came up in https://www.internalfb.com/diff/D69057074?dst_version_fbid=970771615000938&transaction_fbid=1723357345264461 - we need to make sure the GraphModule is valid before calling `print_readable` on it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148090
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #147749
2025-02-28 00:44:27 +00:00
5a14ff8ace Add cufile to list of libraries to preload (#148137)
Fixes: https://github.com/pytorch/pytorch/issues/148120

Test with almalinux/9-base:latest :
```
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module>
    from torch._C import *  # noqa: F403
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory
>>> exit()
[root@18b37257e416 /]# vi /usr/local/lib64/python3.9/site-packages/torch/__init__.py
[root@18b37257e416 /]# python3
Python 3.9.19 (main, Sep 11 2024, 00:00:00)
[GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu126'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148137
Approved by: https://github.com/malfet
2025-02-28 00:35:47 +00:00
40ad5e01df Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 00:15:32 +00:00
2978771c9d [CI] test upload: better check for if job is rerun disabled tests (#148027)
Some disabled test runs weren't being uploaded as disabled tests because some dynamo tests are set to mark themselves as skipped if they are failing.  This makes the script think that there are fewer retries than there are actually are and that the job is not a rerun disabled tests job.  Instead, query for the job name to see if it contains rerun disabled tests and fall back to counting the number of retries if querying fails

Alternate options: relax the check for the number of tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148027
Approved by: https://github.com/huydhn
2025-02-28 00:04:33 +00:00
fc78192b1d ci: Only run CI specific things when in CI (#148126)
This was blocking me from running this locally so don't run it like this

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148126
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/atalman
2025-02-27 23:27:57 +00:00
f4235310e8 [BE][Ez]: Remove redundant empty tensor copies in meta-reg (#147978)
Empty_likes includes a memory_format arg. Let's use it to avoid unnecessary copy operations. Noticed while reviewing: https://github.com/pytorch/pytorch/pull/147862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147978
Approved by: https://github.com/jansel
2025-02-27 23:16:44 +00:00
915b9c80ab [export] Sync aoti schema to schema.py (#148017)
Summary: Synchronizing internal AOTI schema to OSS schema.py

Test Plan: CI

Differential Revision: D70271151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148017
Approved by: https://github.com/yiming0416
2025-02-27 21:46:11 +00:00
871b3909fc ci: Remove manylinux 2014 remnants (#148028)
These are the only remaining references I could find to manylinux2014,
we should probably look to remove these a bit quicker since it made it
difficult to know which Dockerfiles were important in
.ci/docker/manywheel/

> [!TIP]
> I checked if we were using these by running
> `rg 2014 .github/`

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148028
Approved by: https://github.com/wdvr, https://github.com/malfet, https://github.com/atalman
2025-02-27 21:37:00 +00:00
10ffd94216 Reference the commit explicitly (#148026)
Reference the commit tested by CI explicitly, and fail the merge if the PR was updated.

Tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148026
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/atalman
2025-02-27 21:06:34 +00:00
783d83c5d8 [PT2] Port fuse_split_getitem_squeeze to PT2 pre_grad passes (#148059)
Summary: put it as an add_pass option

Reviewed By: frank-wei

Differential Revision: D68909559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148059
Approved by: https://github.com/frank-wei
2025-02-27 21:03:51 +00:00
d48eb58d1d [BE][CI] bump ruff to 0.9.8 (#145606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145606
Approved by: https://github.com/malfet
ghstack dependencies: #144546
2025-02-27 21:01:10 +00:00
644d84d594 Revert "optimize the decomposition of aten.native_group_norm (#144733)"
This reverts commit b533bb4b133c36767270bd8a24f11d5c37f8dd5c.

Reverted https://github.com/pytorch/pytorch/pull/144733 on behalf of https://github.com/desertfire due to Cause TIMM pass rate regression on H100, see https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2020%20Feb%202025%2020%3A53%3A55%20GMT&stopTime=Thu%2C%2027%20Feb%202025%2020%3A53%3A55%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=main&lCommit=4216478250e08e950fdd090fc23a1b270c520cc4&rBranch=main&rCommit=4986f0f52eb871cdb91b8124ee162cfe622b8688 ([comment](https://github.com/pytorch/pytorch/pull/144733#issuecomment-2689092714))
2025-02-27 20:57:25 +00:00
1845e7d1f5 Use nightly-wheel-upload env for triton wheel publishing (#148108)
Required for publishing triton builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148108
Approved by: https://github.com/malfet
2025-02-27 20:47:40 +00:00
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
f0d00421cf [inductor][ck] kBatch filtering with gen_ops (#148004)
Summary:
# Why

not all choices of kBatch are valid and will lead to a runtime error (when CK checks the validity of the args)

c9bcfd755e/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp (L1020)

# What

- move kBatch inside the gen_ops to have more control over it, and be able to filter it
- expand filtering based on the cpp logic
- refactor the padding checks to be more readable

Test Plan:
```
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

with

kBatch = 128: some filering
kBatch = 1: no filering
kBatch = 1738: all options filtered out

Reviewed By: henrylhtsang

Differential Revision: D70211442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148004
Approved by: https://github.com/ColinPeppler, https://github.com/tenpercent
2025-02-27 20:13:58 +00:00
ce805a5ba5 [BE/metal] Rename REGISTER_I0_I1 to REGISTER_SPECIAL. (#148036)
Now that it's used for other ops as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148036
Approved by: https://github.com/malfet, https://github.com/jansel
2025-02-27 17:56:26 +00:00
9a1f720a72 Validate inputs to _nested_view_from_buffer to prevent overflows (#147356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147356
Approved by: https://github.com/albanD, https://github.com/jbschlosser
ghstack dependencies: #147352, #147354
2025-02-27 15:48:58 +00:00
536bce5a04 Make Tensor.set_ validate storage_offset when sizes/strides are unchanged (#147354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147354
Approved by: https://github.com/albanD
ghstack dependencies: #147352
2025-02-27 15:48:58 +00:00
e64441915f Fix overflow in checkInBoundsForStorage (#147352)
Use `computeStorageNbytes` (which checks for overflows) to include the computation re the storage_offset

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147352
Approved by: https://github.com/albanD
2025-02-27 15:48:50 +00:00
6ccbff1450 [Inductor] Fix inductor/test_kernel_benchmark.py for new Triton; do not duplicate parameters in _dump_launch_params (#147746)
The problem is that the new Triton uses the following code branch, which does not filter the call parameters, which may already be in the launcher's cfg.kwargs. This is generally expected behavior, so I just stopped adding arguments from `launcher.config.kwargs`: cde12207a0/torch/_inductor/runtime/triton_heuristics.py (L1099)

Issue example (from https://github.com/intel/intel-xpu-backend-for-triton/issues/3499):

```bash
Failed when when running cleaned triton Command '['/home/xinanlin/xinanlin/miniforge3/bin/python', '/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3b
dmtky5n4j4jrd5k5pu.py.cleaned']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 103, in <module>
    compiled_module_main('None', benchmark_compiled_module)
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/wrapper_benchmark.py", line 435, in compiled_module_main
    wall_time_ms = benchmark_compiled_module_fn(times=times, repeat=repeat) * 1000
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 98, in benchmark_compiled_module
    return print_performance(fn, times=times, repeat=repeat)
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in print_performance
    [timed(model, example_inputs, times, device) for _ in range(repeat)]
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in <listcomp>
    [timed(model, example_inputs, times, device) for _ in range(repeat)]
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 434, in timed
    result = model(*example_inputs)
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 97, in <lambda>
    fn = lambda: call([arg0_1, arg1_1])
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 86, in call
    triton_poi_fused_add_0[grid(1)](arg0_1, arg1_1, buf0, 1, 1, XBLOCK=1, num_warps=1, num_stages=1)
  File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 336, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 531, in run
    bound_args, specialization, options = binder(*args, **kwargs)
TypeError: dynamic_func() got multiple values for argument 'XBLOCK'
```

Reroduce:
`python test/inductor/test_kernel_benchmark.py -k test_remove_inductor_deps`

Triton: c4a79a1960
Pytorch: bea72180ed75f522ce4fe5e723bc2112e0874732

@davidberard98 @etaf please take a look
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147746
Approved by: https://github.com/jansel
2025-02-27 14:40:22 +00:00
2c35af4def [Intel GPU] Avoid including CPU oneDNN header files for Intel GPU (#147969)
XPU builds oneDNN in another folder. The XPU oneDNN head files are in the XPU-specific folder - `${__XPU_MKLDNN_BUILD_DIR}`.
f522d899fb/cmake/Modules/FindMKLDNN.cmake (L73)

 So, `${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn/include` is useless for XPU. `XPU_MKLDNN_INCLUDE` is good enough. Meanwhile, it may mess up the included files if the version of XPU oneDNN differs from other backends.

* __->__ #147969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147969
Approved by: https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/atalman
2025-02-27 14:22:17 +00:00
71ee17baa1 Smoke Test skip cuda.gds on windows (#148060)
Follow up after : https://github.com/pytorch/pytorch/pull/147120
Cufile was enabled only on Linux: https://pypi.org/project/nvidia-cufile-cu12/#files
Fixes validation workflow failues: https://github.com/pytorch/test-infra/actions/runs/13558218752/job/37896578837

```
 File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 105, in __init__
    raise RuntimeError("GdsFile is not supported on this platform.")
RuntimeError: GdsFile is not supported on this platform.
Exception ignored in: <function GdsFile.__del__ at 0x000001772B5003A0>
Traceback (most recent call last):
  File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 113, in __del__
    if self.handle is not None:
AttributeError: 'GdsFile' object has no attribute 'handle'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148060
Approved by: https://github.com/mikaylagawarecki
2025-02-27 14:00:49 +00:00
7ae0e0b2ea [aotd] Log torch._functorch.config in tlparse (#147883)
Adding torch._functorch.config to tlparse for better debugability.
E.g. https://github.com/pytorch/pytorch/pull/147638 happened only with `torch._functorch.config.view_replay_for_aliased_outputs=False` which is True by defautl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147883
Approved by: https://github.com/bdhirsh, https://github.com/jamesjwu
2025-02-27 11:22:45 +00:00
c5bf9aaf1c Log graph breaks (#146537)
Graph breaks currently aren't logged to dynamo_compile and pt2_compile_events. We want to log them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146537
Approved by: https://github.com/c00w
2025-02-27 11:06:33 +00:00
0489a349e7 Skip the logging if the pass cannot be pickled (#148053)
Summary:
Skip the logging for vllm at this moment, we can add some pickle logic later.

The log is only for debugging purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148053
Approved by: https://github.com/chenyang78
2025-02-27 10:54:34 +00:00
26f19539ad [triton 3.3] cpp_wrapper: add a global_scratch arg (#148051)
Following triton # 4916, the generated cubin expects a global_scratch argument to support on-device TMA. We believe this is the source of many of the "invalid argument" failures on AOTI/cpp_wrapper tests. AFAIK, we don't use on-device TMA in Inductor as of now, so it should be safe to use a nullptr for the scratch space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148051
Approved by: https://github.com/YUNQIUGUO
2025-02-27 10:13:57 +00:00
91e7c7945c [Intel GPU] Avoid unnecessary copy when the dst of Matmul is non-contiguous (#144759)
We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in https://github.com/pytorch/pytorch/pull/143784

I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144759
Approved by: https://github.com/EikanWang
2025-02-27 08:04:34 +00:00
8ee84aa703 [ONNX] Fix missed None type support in dyamic shapes string cases (#148025)
In `_any_str_or_dim_in_dynamic_shapes`, we strictly guard the `dynamic_shapes` to make sure the flattened shapes are valid. But the code missed to consider None could be in the shapes.

NOTE: Found in benchmarking with Olive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148025
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-02-27 07:57:47 +00:00
fd43c36aa9 [ca] side-effect free initial trace: RAII PyCompilerInterface (#147891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147891
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796, #147804
2025-02-27 07:17:30 +00:00
9017becf1d Add unique kernel name support for user defined triton kernel (#147587)
Summary:
Add unique_user_kernel_names which mimics what unique_kernel_names do, but for user defined Triton kernels.
This does rewrite the copied kernel src, and modifies non-Inductor generated code, so we split it out from unique_kernel_names, where we have more control over all namings and generations.

Test Plan: Only used for debug purpose

Differential Revision: D69966608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147587
Approved by: https://github.com/desertfire
2025-02-27 06:00:50 +00:00
c622796cde Revert "Build a storage reader/writer to write checkpoints in HF format (#147622)"
This reverts commit 6a658d983e84f7bcb8e67328b00661ec49db78c5.

Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))
2025-02-27 05:14:28 +00:00
21bd5fe203 Update torch-xpu-ops commit pin (#147968)
Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](86aaaf8a9d), includes:

- Bugfix (PT2E/BatchNorm)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968
Approved by: https://github.com/Skylion007
2025-02-27 05:01:12 +00:00
b6fe28ff02 [Inductor] Graph Partition (#147038)
This PR implements inductor graph partition. Previously, 1 dynamo graph is mapped to 1 inductor graph, and further mapped to 1 call function. In this PR, we allow 1 dynamo graph mapped to multiple inductor graphs and multiple `graph_partition` functions in the generated code. This allows applying different further optimizations to different `graph_partition`.

Design Doc: [link](https://docs.google.com/document/d/1qPgOfy25l7SIYnrQrvU-TO1mdHMslCwv_SLmeXID6tM/edit?usp=sharing)
Example: [Generated code before and after this diff](https://www.internalfb.com/intern/diffing/?paste_number=1737334601)

In the follow-up PR, we will extend the work to cudagraph, which allows applying cudagraph to parts of the generated code (#125864).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147038
Approved by: https://github.com/eellison
2025-02-27 04:50:43 +00:00
e0b93082f1 Remove HuggingFace reader and writer from __init__.py (#148030)
Summary: This is causing a HFStorageReader/Writer to be imported which imports fsspec but dependencies don't have fsspec, which is causing failing builds

Differential Revision: D70286926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148030
Approved by: https://github.com/hl475
2025-02-27 04:50:14 +00:00
8cb8722979 [inductor][triton] Ignore block ptr advances for removed buffers (#147193)
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-27 03:37:33 +00:00
17358ce778 Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878)"
This reverts commit ad0c879e2203145f6d56df0b95af36822220ab8f.

Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))
2025-02-27 03:36:16 +00:00
9d3636283b [Inductor] Use generic GPU device in test_preserves_strides (#148006)
#147861 added a new test tagged for the generic GPU but uses the cuda GPU type for creating the tensors. Update the GPU type to also be generic. This passes with my local Intel Triton install, presumably it will work for the current pin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148006
Approved by: https://github.com/eellison, https://github.com/etaf
2025-02-27 02:52:51 +00:00
07b7b3ed4e torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-27 02:44:39 +00:00
84c89a4527 [cutlass backend] cache_clear algorithm select cache on fresh inductor cache (#147590)
Differential Revision: [D69959917](https://our.internmc.facebook.com/intern/diff/D69959917/)

AlgorithmSelectorCache is a cache. The expectation is that when we force disable cache + clear inductor caches, it would be clear. However that is not the case.

The reason why this is a problem can be seen by following this repro:
What we will see is
```
SingleProcess AUTOTUNE benchmarking takes 6.2202 seconds and 46.0568 seconds precompiling for 36 choices
SingleProcess AUTOTUNE benchmarking takes 492.3141 seconds and 0.0010 seconds precompiling for 36 choices
```

The root cause is, while precompiling is skipped, due to it being cache, autotuning isn't skipped since we force disable it.

repro:
```
import logging
import os

os.environ["TORCH_LOGS"] = "+output_code,+benchmarking,+inductor"

import torch

import torch._inductor.config
from torch._inductor.utils import clear_inductor_caches

torch._inductor.config.max_autotune = True
torch._inductor.config.force_disable_caches = True
torch._inductor.config.autotune_num_choices_displayed = None
torch._inductor.config.max_autotune_gemm_backends = "CUTLASS"
torch._inductor.config.autotune_fallback_to_aten = False
torch._inductor.config.cuda.cutlass_instantiation_level = "0001"

def main():
    M, N, K = 2048, 2048, 2048
    dtype = torch.bfloat16
    A = torch.randn(M, K, device="cuda", dtype=dtype)
    B = torch.randn(K, N, device="cuda", dtype=dtype)
    for _ in range(2):
        torch._dynamo.reset()
        clear_inductor_caches()
        compiled_model = torch.compile(torch.mm, fullgraph=True)
        _ = compiled_model(A, B)

    print("done")

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147590
Approved by: https://github.com/eellison, https://github.com/chenyang78
2025-02-27 02:30:49 +00:00
97ebccaa91 Add _fft_r2c as core ATen (#147998)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147998
Approved by: https://github.com/tugsbayasgalan
2025-02-27 02:29:59 +00:00
ad0c879e22 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-27 02:08:29 +00:00
784902983e Remove +PTX from cuda 12.6 builds (#148000)
Similar to: https://github.com/pytorch/pytorch/pull/141142

Ahead of the release 2.7
I see following validation failure: https://github.com/pytorch/test-infra/actions/runs/13552433445/job/37879041739?pr=6339
```
RuntimeError: Binary size of torch-2.7.0.dev20250226+cu126-cp310-cp310-manylinux_2_28_x86_64.whl 1076.45 MB exceeds the threshold 750 MB
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148000
Approved by: https://github.com/clee2000, https://github.com/ngimel, https://github.com/tinglvv
2025-02-27 02:02:11 +00:00
20ce67cd06 Udpate hw requirement for FP64 on "Getting Started on Intel GPU" (#147802)
Fixes #147731

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147802
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-27 01:54:19 +00:00
cyy
9ca871f32b Remove binaries/benchmark_args.h (#147920)
It's not used in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147920
Approved by: https://github.com/Skylion007
2025-02-27 01:16:28 +00:00
ea5d40db73 Address source code building command for Intel GPU support (#143476)
As the title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143476
Approved by: https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Xu Han <xu.han@outlook.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-27 01:07:40 +00:00
f104ef1248 [AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode (#147975)
Summary: Let CppBuilder handle all the cpp build logic

Differential Revision: D70141808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147975
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-02-27 00:35:12 +00:00
f98cd84b04 cpp_wrapper: use largeTensorTest for test memory checks (#146991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146991
Approved by: https://github.com/desertfire
2025-02-27 00:30:21 +00:00
723f3a9eab torch.utils._content_store: fix error in hash_storage on XPU (#147785)
See https://github.com/pytorch/pytorch/actions/runs/13508573465/job/37745227468 for an example error. This is triggering after the merge of #147541, which enabled Dynamo compilation on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147785
Approved by: https://github.com/jansel
2025-02-26 23:57:59 +00:00
915eb012e1 Revert "[dynamo] add sourceless builder for types.MethodType (#147880)"
This reverts commit 08f4c1a2332921e57c782c80a66b2adc9cdc0575.

Reverted https://github.com/pytorch/pytorch/pull/147880 on behalf of https://github.com/wdvr due to failing trunk tests ([comment](https://github.com/pytorch/pytorch/pull/147880#issuecomment-2686436432))
2025-02-26 23:29:58 +00:00
84e60eece8 [ROCm] [TunableOp] Unit tests for scaled GEMM and GEMM with bias (#147890)
Two more unit tests for TunableOp:
- Scaled GEMM
- GEMM with bias

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147890
Approved by: https://github.com/jeffdaily
2025-02-26 22:41:24 +00:00
b13ad1a193 [ROCm][TunableOp] Remove extra transpose characters in hipBLASLt signature. (#147900)
Cleanup the TunableOp hipBLASLt signature of extra transpose characters.

Test manually and no new regressions found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147900
Approved by: https://github.com/jeffdaily
2025-02-26 22:28:00 +00:00
7e7d05bf85 Revert "[do not merge yet] update grammar (#147996)"
This reverts commit 6e129a697f86425d0682ed30ffc9b3f8abe00e9e.

Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686291282))
2025-02-26 22:01:12 +00:00
6e129a697f [do not merge yet] update grammar (#147996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996
Approved by: https://github.com/seemethere
2025-02-26 21:52:58 +00:00
dc7556f1bd Revert "[do not merge yet] update grammar (#147996)"
This reverts commit a1ee2c3a08c3bf3d83c4e9f352ea179c107edb13.

Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686266052))
2025-02-26 21:43:06 +00:00
a1ee2c3a08 [do not merge yet] update grammar (#147996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996
Approved by: https://github.com/seemethere
2025-02-26 21:39:08 +00:00
201666d77d [cutlass backend] turn autotuning logs off by default + rename log to autotuning log (#147922)
things we did:
* turn off autotuning logs by default
* rename autotuning logs from log to autotuning_log, so people are aware that it is a special artifact log.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147922
Approved by: https://github.com/eellison
2025-02-26 21:02:04 +00:00
976ff5cf01 Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418)
per title

sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-02-26 20:52:28 +00:00
6a658d983e Build a storage reader/writer to write checkpoints in HF format (#147622)
Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.
Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622
Approved by: https://github.com/saumishr
2025-02-26 20:47:54 +00:00
7c71ab1d40 [scan] User-facing reverse flag handling (#147886)
This PR removes the reverse flag from the backend implementation and resolves it via `torch.flip` in the frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147886
Approved by: https://github.com/ydwu4
2025-02-26 20:04:57 +00:00
683e083e8d [MPS] Add support for entr() in eager. (#147948)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147948
Approved by: https://github.com/malfet
2025-02-26 19:55:02 +00:00
eb08ada5d3 [dynamo] Support reads to global/captured tensors in nonstrict_trace-ed function (#147572)
As title. Without this patch we get the following error:

Tweaking the `allow_non_fake_inputs` flag on tensor mode doesn't quite
work for AOTAutograd, which also needs to fake-tensor-propagate the
`nonstrict_trace`-ed function, but that's _after_ Dynamo has handled the
`nonstrict_trace` processing and put the `flat_apply(...)` node into the graph.

So we can't easily to temporarily enable the `allow_non_fake_inputs`
flag on current fake mode, when AOTAutograd processes a `flat_apply`
node from Dynamo's `nonstrict_trace` handling. And after discussing
with zou3519, I decided to add a global `FakeTensorTLS` that contains a
`allow_non_fake_inputs_override` flag, and patch the `nonstrict_trace`-ed
function to temporarily tweak this flag during its execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147572
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367, #146950, #147571
2025-02-26 19:47:39 +00:00
73e963459e [dynamo] Support nonstrict_trace on class method (#147571)
As title, also see
1. new test `test_nonstrict_trace_on_method` for example.
2. newly added comments for why we need special treatment on methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147571
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367, #146950
2025-02-26 19:47:39 +00:00
7e0ef2c844 [dynamo] Use the new get_unique_name_wrt helper when applicable (#146950)
This patch removes some duplicated name generation logic in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146950
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367
2025-02-26 19:47:39 +00:00
f46f0e465c [dynamo] Initial support for nonstrict_trace (#146367)
## Context
> **Note:** `mark_traceable` got renamed to `nonstrict_trace` after
> offline discussion. The reasons are (1) it aligns with `torch.export`'s
> `nonstrict` notion, and (2) it's more definitive in behavior suggestion.

1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0)
2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn)

## Summary
This patch adds a `torch._dynamo.nonstrict_trace` decorator, which
currently is an enhanced version of `torch._dynamo.allow_in_graph` (see
docstring for their differences). Specifically, this patch focuses on
the UI and functionality prototyping/plumbing.

The main enhancement is supporting more input types, and the
implementation challenge lies in reconstructing the input objects from
Dynamo `VariableTracker` (while accounting for buffered side-effects and
guards).  This patch takes a middle-ground (simple implementation with a
bit of user labor), by
1. asking the user to provide pytree registration for non-proxy-able
   input types,
2. letting Dynamo trace through `pytree_flatten` (which accounts for
   buffered side-effects and guards automatically),
3. and passing in the TreeSpec as a graph attribute constant into
   `torch._higher_order_ops.flat_apply` (which unflattens the inputs and
   invokes the underlying function).

## Next Steps
In subsequent patches, we will try to support the following:
- annotating on class method
- reads to global tensors
- inputs that contains `pytree.register_constant`-ed instances.
- function as input
- more output types (e.g., any pytree-registered type)
- `torch.nn.Module` as inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367
Approved by: https://github.com/zou3519
ghstack dependencies: #146714
2025-02-26 19:47:39 +00:00
bab84f0bd9 [hop] Support more output types for flat_apply (#146714)
This patch enables `flat_apply` to support certain non-Tensor output
types like containers and graphable types. This will in turn enable the
upcoming `mark_traceable` to support more output types.

The patch also exposes a `func_to_graphable` rather than having the
users calling the lower level `pytree.flatten(ConstantFunction(...))`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146714
Approved by: https://github.com/zou3519
2025-02-26 19:47:39 +00:00
8594856651 [aotd] Alias of intermediate unwrap TensorAlias (#147638)
Bug was reported by internal user.

AOTD classified outputs that are aliases of intermediates of the graph in different categories.

...
- output is alias of intermediate which base is already output
- output is alias of intermediate which base is not in output

If we look at the fn:
```
def fn(x):
    ix = x + 1
    a = ix.transpose(0, 1)
    return a.detach(), a
```

output 0: detach view of alias a, where a is already output
output 1: alias of intermediate ix, then additional output ix will be added internally

output 0 base is TensorAlias(a) in this case, but could be Tensor.
Adding runtime unwrapping solves this problem.

Alternatively we should track base of a.detach() all the way to ix, in that case the base will be always a Tensor, not TensorAlias.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147638
Approved by: https://github.com/bdhirsh
2025-02-26 19:42:21 +00:00
30db64bf51 [PT2] Support add/remove passes in pre_grad (#146064)
Summary:
support the same functionality with acc_tracer disabled, add a new config for pre_grad add/remove_passes, at the front end it still uses the same interface

some minor updates in pre_grad passes to make sure the passes are run in desired order, after added passes, still run pass like remove_noops at the end

Test Plan: add new UT, please see stacked diff for add pass tests (TODO: update diff link)

Differential Revision: D68909278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146064
Approved by: https://github.com/frank-wei
2025-02-26 18:46:43 +00:00
00732c3f7e [MPS] Implemented masked_fill_scalar as shader (#147369)
- Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header
- Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop
```metal
ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim);
if (mask[thread_index]) {
  StridedTensor<T> input(input_data, sizes, input_strides, ndim);
  input[thread_index] = val;
}
```
But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow

- Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided`  which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors.

Performance measured on M2Pro thru different iterations of the same shader

| dtype | MPS | int64-idx | int64-inlined | 32-bit strided | 32-bit broadcasted |
| ------|------| -----|   ---- | --- | ---- |
| float32 | 2.8 msec  | 41.6 msec | 26.9 msec | 5 msec | 2.4 msec |
| float16 | 1.86 msec | 38.2 msec| 26.6 msec | 4.6 msec | 1.9 msec |
|bfloat16|1.86 msec |38.3 msec | 26.6 msec | 4.6 msec | 1.9 msec |

And benchmark script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_mask_fill(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()",
        setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_mask_fill(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```
Fixes https://github.com/pytorch/pytorch/issues/143477
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369
Approved by: https://github.com/dcci
ghstack dependencies: #147977
2025-02-26 18:39:15 +00:00
ebf6b9839c [MPS] faster integer batched matmul (#147877)
Followup to #147526
Tiled matmul for bmm as well.

## Speed ups:
![speedups_bmm](https://github.com/user-attachments/assets/02501145-7d64-4bbe-9dcc-994f004b4829)

Script to record times:
```python
import torch
import numpy as np
import time
import csv

batch_sizes = [1, 2, 4, 8]
matrix_sizes = [256, 512, 1024, 2048]
num_runs = 10
warmup_runs = 3

def run_int_mm(A, B):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = A @ B
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'B': [],
    'mean_time': [],
    'std_time': []
}

for b in batch_sizes:
    for n in matrix_sizes:
        print(f"\nBenchmarking N={n} and B={b}")

        try:
            A_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps")
            B_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps")

            for _ in range(warmup_runs):
                _, _ = run_int_mm(A_mps, B_mps)

            times = []
            for _ in range(num_runs):
                _, t = run_int_mm(A_mps, B_mps)
                times.append(t)

            mean_time = np.mean(times)
            std_time = np.std(times)

            results['N'].append(n)
            results['B'].append(b)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)

            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

        except RuntimeError as e:
            print(f"Error for N={n}: {e}")
            continue

with open('int_bmm_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['B'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147877
Approved by: https://github.com/Skylion007
2025-02-26 18:37:13 +00:00
cfb293ee02 [inductor] Add logs for precompile and autotuning (#147923)
Differential Revision: D70222645

I want to add more logs around precompile, especially around the reason why sometimes it gets fast returned. See https://github.com/pytorch/pytorch/pull/147590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147923
Approved by: https://github.com/Skylion007
2025-02-26 18:26:07 +00:00
0ea5d1067b ROCm: Remove static specifier for allow_tf32 variable. (#147186)
Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186
Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd
2025-02-26 18:24:02 +00:00
4e4191854b [logs][qol] Print log options alphabetically (#147888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147888
Approved by: https://github.com/jansel
2025-02-26 18:15:39 +00:00
fb566c5aea Fix auto_functionalize x inference_mode (#147925)
Fixes #147924

We were using the wrong FunctionalTensorMode to construct
FunctionalTensors. FunctionalTensors modify the FunctionalTensorMode on
construction, so that led to the wrong FunctionalTensorMode being
modified. This PR threads the FunctionalTensorMode through correctly.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147925
Approved by: https://github.com/bdhirsh
2025-02-26 18:05:30 +00:00
678435c443 [FlexAttention] Fix IMA bug (#147918)
# Summary
Fixes: https://github.com/pytorch/pytorch/issues/147268

I got this right for the backwards and somehow forgot to do the flip in the forward, not sure how this wasnt found earlier..

Testing IMAs is tuff in pytest so didnt add but verified on reproducer

```py
❯ sanitize python flex/maurice_ima.py --setting 0
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(1.0078, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(1.7994, device='cuda:0')
========= ERROR SUMMARY: 0 errors
❯ sanitize python flex/maurice_ima.py --setting 1
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(2.8297, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(1.9714, device='cuda:0')
========= ERROR SUMMARY: 0 errors
❯ sanitize python flex/maurice_ima.py --setting 2
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(3.2232, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(2.2095, device='cuda:0')
========= ERROR SUMMARY: 0 errors
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147918
Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007
2025-02-26 17:59:05 +00:00
3f7e242c86 [CI] Checkout with more processes (#147652)
The default action doesn't use more processes, possibly because most github provided runners only have 2 cpus, but we have more than that, so we might as well use them

Generally cuts maybe 1 min off of checkout time?

Changed checkout from pytorch/pytorch@main to pytorch/pytorch@my branch to test on 249a936998e66cc0d6ad8664e0e93ec1b9432a8b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147652
Approved by: https://github.com/ZainRizvi
2025-02-26 17:51:28 +00:00
ef61c290e1 [DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025)
Resolves https://github.com/pytorch/pytorch/issues/146767.

May also resolve https://github.com/pytorch/pytorch/issues/147584.

### Summary
This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons:

1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present.
2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution.

Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method.

### Consequence

DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`.

### Test
`pytest test/distributed/tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
`pytest test/distributed/tensor/parallel/test_tp_style.py`

Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025
Approved by: https://github.com/kwen2501
2025-02-26 17:33:22 +00:00
5ef94ca816 [BE] Do not copy arguments in variadic template (#147977)
By adding missing  `std::forward<Args>(args)...` and declaring template as passing args by reference

Noticed while working on creating `mtl_setBytes` specification that takes `MPSScalar` as argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147977
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-26 17:20:16 +00:00
ba9ed856e0 [FlexAttention] Improve error msg for embedding < 16 (#147765)
flex_attention uses tl.dot, which [does not support embedding < 16](https://github.com/triton-lang/triton/issues/2266) on input shapes. This PR adds explicit error message for users who are prototyping with small tensors.

Fixes #147701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147765
Approved by: https://github.com/drisspg
2025-02-26 17:06:35 +00:00
ac926f81cc [Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)
Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395
Approved by: https://github.com/eellison
2025-02-26 16:56:17 +00:00
fd1220e386 [ca] side-effect free inital trace: compiled_args (#147804)
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796
2025-02-26 16:37:27 +00:00
5e3069dde8 [ca] side-effect free initial trace: GraphTask (#147796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796
Approved by: https://github.com/jansel
ghstack dependencies: #147242
2025-02-26 16:37:27 +00:00
0a2da008f8 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-26 16:37:17 +00:00
08f4c1a233 [dynamo] add sourceless builder for types.MethodType (#147880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880
Approved by: https://github.com/jansel
2025-02-26 15:43:47 +00:00
edaf9ddeb5 Add basic Gaudi support to benchmarks/dynamo (#145920)
This PR adds basic Gaudi support to benchmarks/dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920
Approved by: https://github.com/eellison
2025-02-26 14:50:22 +00:00
be830c8b1c [Inductor][CPP] fix store mode atomic add (#147961)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/147848 and https://github.com/pytorch/pytorch/issues/146390. While addressing these issues, 2 problems were encountered:

- In `CppVecKernel`, when the number of threads is 1 and the mode is `atomic_add`, `store` did not `load/add` before storing. This has been fixed in this PR.

- In `CppTile2DKernel`, `store` did not support `atomic_add` mode. Support for this has been added in this PR.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_nn_fold
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147961
Approved by: https://github.com/malfet
2025-02-26 14:04:34 +00:00
f522d899fb Add MSVC version condition to "Fix for MSVC problem on Windows Arm64 (#136765)" (#145076)
This PR adds MSVC version guards around the if block presented on f7e36d8d6f9706ee9b9653538c4c8d2ba375a181. This commit was to provide a workaround for the problem reported here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 .
The issue is fixed now and only appears between versions 19.36 and 19.42.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145076
Approved by: https://github.com/malfet, https://github.com/alinpahontu2912

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-02-26 12:08:24 +00:00
60d94ea22b Add option to limit number of SMs used by matmul kernels (#147966)
Resubmission of #144974 which was reverted for unrelated reasons.

Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966
Approved by: https://github.com/danthe3rd
2025-02-26 12:01:12 +00:00
7ffae2c028 Split test_transformers.py (#147441)
Split test_transformers.py into test_transformers.py and test_transformers_privateuser1.py. Currently the privateuse1 test cases in test_transformers.py are skipped since they conflict with cuda test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147441
Approved by: https://github.com/drisspg
2025-02-26 11:54:24 +00:00
cf6d1e6824 [dynamo] add generic graph break hints (#147429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147429
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #147385
2025-02-26 09:20:28 +00:00
3fd68e4e2f [dynamo] make some more graph break messages readable in English [2/N] (#147385)
This is for "for some large number Z, make sure the error messages are readable English." - beginning to audit all `unimplemented` sites and making sure that all messages are at least English-readable. Hints may not necessarily be provided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147385
Approved by: https://github.com/jansel
2025-02-26 09:20:28 +00:00
7a06bfdd1c [inductor][ck] kBatch parametrized (#147885)
Summary:
# Why

Enable us to set the kBatch parameter, rather than bake it in

Especially for larger splitK scenarios, this can yield very good performance (up to 1.5x vs hipblaslt from initial tests)

## Why like this

The obvious question should be: why not add this to the op itself, and maybe even into the template/kernel. That would simplify the code.

The choice to have it as a "runtime" param that we fix is be able to reuse the compiled CK `.so` libraries, as now multiple choices of kBatch can be used with the exact same `.so` (as the shared library does not depend on kBatch, but takes it as a parameter)

# What

- copy cutlass approach for swizzle to have a "runtime" arg that we pass in but is really choice dependent
- pipe through everything from template and kernel
- hard-code it to be kBatch=1 for now (same as before, just now settable)

This is part of a series of Diffs, where next we need to figure out
1. how to filter out ops + kBatch that don't work
2. set this better for splitK scenarios (hand written heuristic)

Test Plan:
(with minor modifications)

```
# show it working with AOTI
buck2 run mode/opt-amd-gpu //scripts/henrylhtsang/repros:aot
```

```
# show it working with inductor only
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

Differential Revision: D70200008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147885
Approved by: https://github.com/ColinPeppler
2025-02-26 07:28:19 +00:00
a84db75e1b Revert "torch._scaled_mm with MXFP8 (#147548)"
This reverts commit 12b9674cb603438639298d6c9757ea93e18a7289.

Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - similar to previous, see below ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2684134336))
2025-02-26 07:17:24 +00:00
4216478250 Fix the benchmark config name from H100 benchmark (#147947)
When using the wrong benchmark configs, the benchmark jobs will be skipped.  The name should have the `_cuda_h100` suffix as used in the test matrix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147947
Approved by: https://github.com/wdvr
2025-02-26 06:40:07 +00:00
4ec6c1d1ec Fix test_halide.py report invocation to re-run failed tests (#147640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147640
Approved by: https://github.com/jansel
2025-02-26 06:32:22 +00:00
acca9b9cb0 Revert "[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803)"
This reverts commit 0b9da1ae0ad30ef228f132354b875bcaec214ace.

Reverted https://github.com/pytorch/pytorch/pull/147803 on behalf of https://github.com/wdvr due to breaking internal tests, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147803#issuecomment-2683938121))
2025-02-26 05:32:17 +00:00
12b9674cb6 torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-26 05:21:26 +00:00
9ed40af917 [BE][EZ] Delete MacOS-12.3 xfail list (#147905)
As PyTorch requires at least MacOS-13 (and Metal-3) to work, delete any pre-MacoS13 checks from test script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147905
Approved by: https://github.com/dcci
ghstack dependencies: #147892
2025-02-26 05:08:09 +00:00
a2399c9b44 [BE] Switch index_variable to torch.testing.make_tensor (#147892)
As it was a long-time todo and actually ublocks using this function for MPS devices (that do not support double)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147892
Approved by: https://github.com/dcci
2025-02-26 05:08:09 +00:00
c839fa4dd2 [Resubmit] Record input strides at time of tracing, constrain to them for triton fn (#147861)
Resubmit of https://github.com/pytorch/pytorch/pull/145448. it lost its changes on rebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147861
Approved by: https://github.com/zou3519
2025-02-26 05:05:06 +00:00
ba25e26baa [ROCm] Use IPT=8 for block radix sort (#147657)
Improve performance for shapes that use block radix sort by decreasing the item_per_thread to 8.
This will increase the thread block size leading to higher occupancy.

Co-author: @amd-sushetty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147657
Approved by: https://github.com/jeffdaily
2025-02-26 04:22:16 +00:00
f211818bc0 [c10d] Restrict use condition of NCCL mem pool (#147764)
Add check to see if CUDA driver support multicast, as does in Symmetric Memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147764
Approved by: https://github.com/syed-ahmed, https://github.com/yifuwang
2025-02-26 03:40:00 +00:00
d3fc583ff0 [cutlass backend] force_disable_caches for test_number_mm_precompiles (#147901)
Summary: Test is flaky right now.

Differential Revision: D70209511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147901
Approved by: https://github.com/ColinPeppler
2025-02-26 03:22:49 +00:00
9ad0ad6497 [MPS] Introduce a shader for entr(). (#147914)
To be used in eager/inductor in order to implement the missing operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147914
Approved by: https://github.com/malfet
2025-02-26 02:54:44 +00:00
805f7d97f7 [Inductor][Optimus] Fix a corner case in split cat aten pass (#147784)
Summary: We need to further check the input of the cat to make sure all of them are from the same split node.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/c875cbdd-5374-46cf-811c-45f91cf6ba3e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10977524161964655
Network: Up: 64KiB  Down: 27KiB  (reSessionID-2e5915cb-4894-48f6-ab1c-3981adb42dab)
Executing actions. Remaining     0/3                                                                         1.5s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:52.1s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

before
aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e

tlparse:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

after
aps-recgpt_ig_emb_pt2_comment_out-c03f74e353

Differential Revision: D70132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147784
Approved by: https://github.com/Microve
2025-02-26 02:19:48 +00:00
b533bb4b13 optimize the decomposition of aten.native_group_norm (#144733)
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.

The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],

The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],

I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-26 01:42:46 +00:00
12112fd198 Fix bug in FSDP wrapped module with zero argument (#147771)
Fixes https://github.com/pytorch/pytorch/issues/147531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147771
Approved by: https://github.com/awgu
2025-02-26 01:40:53 +00:00
8de6fe8c0b [docs] fix numpy docs reference (#147697)
Fix a link to numpy documentation that has moved and now 404's

I"ve checked other numpy doc links that point to docs.scipy.org (which then redirects to numpy.org) and they do work, so I am fixing just this 404.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147697
Approved by: https://github.com/soulitzer
2025-02-26 01:30:03 +00:00
90e3a3d86d Revert "[ca] trace saved variable unpacking (#147242)"
This reverts commit 68ddca94498fd7961cc5ebcb0dffafb8c2f4baca.

Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))
2025-02-26 00:40:16 +00:00
4d614baa30 Revert "[ca] side-effect free initial trace: GraphTask (#147796)"
This reverts commit 5758743f3c92f9cd9b61bc435602f13dd19c13d7.

Reverted https://github.com/pytorch/pytorch/pull/147796 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147796#issuecomment-2683599896))
2025-02-26 00:36:08 +00:00
143f0f0006 Revert "[ca] side-effect free inital trace: compiled_args (#147804)"
This reverts commit ec768d8dc04b334e01db1a90e4e6646e4e867e67.

Reverted https://github.com/pytorch/pytorch/pull/147804 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147804#issuecomment-2683594740))
2025-02-26 00:31:40 +00:00
3ecfe6be25 [Submodule] Turning flash-attention integration into 3rd party submod (#144120) (#146372)
Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
276dfe8150 [dynamo][cpp-guards] Disable dict-tag optim if the guard_manager has child accessors (#147694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147694
Approved by: https://github.com/isuruf
2025-02-26 00:02:08 +00:00
8e7e5ba182 Add sparse tensors constructed via legacy constructor to _sparse_tensors_to_validate (#147759)
This is a redo of https://github.com/pytorch/pytorch/pull/147408 which added validation at the end of the legacy constructor calls.

The reason why I didn't land that was because in `legacy_load`, constructor would be called before storages of indices/values are set. So the tensor would not actually be validated.

Technically, torch.sparse.{Foo}Tensor should not even be called by our rebuild process since afaict this was the first PR that added support for sparse tensor serialization https://github.com/pytorch/pytorch/pull/27062 and it already uses `_rebuild_sparse_tensor` (which would add the rebuilt tensor to the list to validate), but torch.sparse.FooTensor is allowlisted

This PR adds tensors constructed as such to the list to validate at the end of torch.load.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147759
Approved by: https://github.com/albanD
2025-02-25 23:51:12 +00:00
c82c1411c6 Revert "torch._scaled_mm with MXFP8 (#147548)"
This reverts commit e34c15a05b027b9da0962c971d448138fcf94926.

Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2683517851))
2025-02-25 23:28:15 +00:00
0633f63f0d [cutlass backend] try fix standlone runner test (#147811)
Differential Revision: [D70147859](https://our.internmc.facebook.com/intern/diff/D70147859/)

Trying to fix this test one last time, especially when mixed mm is getting removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147811
Approved by: https://github.com/chenyang78
2025-02-25 23:27:02 +00:00
05bc8fe62e Revert "follow up to #147548, fix regression on MI300 (#147878)"
This reverts commit cc444e75d540daff127f0210b7f8965a5c2b8d2a.

Reverted https://github.com/pytorch/pytorch/pull/147878 on behalf of https://github.com/wdvr due to temporary reverting to revert an older one in the stack ([comment](https://github.com/pytorch/pytorch/pull/147878#issuecomment-2683515567))
2025-02-25 23:25:59 +00:00
2df9a8d72d [Inductor][Tests] Update get_divisible_by_16 function in test_torchinductor.py to work correctly with new Triton (#147865)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147865
Approved by: https://github.com/davidberard98
2025-02-25 23:14:13 +00:00
1e894d2635 Revert "Add option to limit number of SMs used by matmul kernels (#144974)"
This reverts commit af2d63637ed025789679a17c241e6bb466508a1d.

Reverted https://github.com/pytorch/pytorch/pull/144974 on behalf of https://github.com/wdvr due to reverting in order to revert #147548 that causes a merge conflict ([comment](https://github.com/pytorch/pytorch/pull/144974#issuecomment-2683461733))
2025-02-25 22:46:38 +00:00
cc444e75d5 follow up to #147548, fix regression on MI300 (#147878)
Removing curly braces seemed superficial but broke MI300 rowwise matmul.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147878
Approved by: https://github.com/drisspg
2025-02-25 22:16:28 +00:00
a821d69d92 Fix register constant to be usable in exportz (#147533)
Differential Revision: [D69939737](https://our.internmc.facebook.com/intern/diff/D69939737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147533
Approved by: https://github.com/zou3519
2025-02-25 21:10:47 +00:00
0d31c621a3 Revert "[inductor][triton] Ignore block ptr advances for removed buffers (#147193)"
This reverts commit 17766b7aad0d9931bb6b3485fcf3d4c7532c3557.

Reverted https://github.com/pytorch/pytorch/pull/147193 on behalf of https://github.com/wdvr due to failing tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147193#issuecomment-2683286358))
2025-02-25 21:04:04 +00:00
6eb3d1e762 [DCP] Cache save plans in default planner (#147343)
Summary:
This PR caches the save plans to significantly reduce the collective cost for successive checkpoint save attempts. Here is the high level approach:
-  Create the local plan and cache the same.
- In next iteration, compare the local plan with the cached plan metadata. If no change, do not send that local plan in the collective.
- Global plan step, will only create the global plan with the new delta plans and empty plans for the cached ones.
- Finish plan step will check for the empty plans. If its empty, it will grab the cached plan. If not, it will use the new plan provided.

Test Plan: UTs

Differential Revision: D69224491

## How to enable the caching:
DefaultSavePlanner introduces the enable_plan_caching which is set to False by default for now.
https://github.com/pytorch/pytorch/pull/147343/files#diff-579bbb7b82572753afa91085fbf954f7c7613ff8376da9b26153d5cc3a3c4ee8R77
Set this to True to enable the caching and we should see significant speed up in the subsequent checkpoint save attempts, specially for larger scale jobs. Reference issue: https://github.com/pytorch/pytorch/issues/123695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147343
Approved by: https://github.com/MeetVadakkanchery
2025-02-25 20:59:25 +00:00
8d921eb97f export method (#147573)
The `export` API takes a `nn.Module` and traces its `forward` method. However sometimes it is useful to export different methods of a `nn.Module`, either as a one-off for debugging or as a set of methods that are called in some sequence outside `export` (e.g., `encode` / `decode`). When multiple methods of the same module instance are exported, they should share the same of the common module instance.

This PR adds a couple of utils in `torch._export.utils` for this workflow.

The `wrap_method` util wraps a method as a `nn.Module` that can then be exported. See included test. We recommend using the same module instance to export multiple methods on that instance, in which case they are guaranteed to share  state. On serde, this state sharing is lost, so we provide another util, `sync_state`, to re-sync the state.

These utils are meant to be eventually replaced by API-level changes, but for now this can unblock users who need this workflow. In particular, in the future we can accept one or multiple method entrypoints, with their own args / kwargs / dynamic shape specifications, which can create a variant of `ExportedProgram` with multiple graphs that share state; then we can automatically ensure that the state sharing is preserved through serde.

Differential Revision: [D69960801](https://our.internmc.facebook.com/intern/diff/D69960801/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147573
Approved by: https://github.com/tugsbayasgalan
2025-02-25 20:58:54 +00:00
687fe64667 Fix crash in -[PTMCoreMLCompiler _compileModel:atPath:] (#147809)
Summary:
We could hit one of those exceptions:
https://github.com/apple/coremltools/blob/main/modelpackage/src/ModelPackage.cpp#L205-L225

And it would make this code path crash.

Test Plan: build.

Differential Revision: D70122378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147809
Approved by: https://github.com/mcr229
2025-02-25 20:56:16 +00:00
ec768d8dc0 [ca] side-effect free inital trace: compiled_args (#147804)
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796
2025-02-25 20:38:51 +00:00
5758743f3c [ca] side-effect free initial trace: GraphTask (#147796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796
Approved by: https://github.com/jansel
ghstack dependencies: #147242
2025-02-25 20:38:51 +00:00
68ddca9449 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-25 20:38:51 +00:00
adf0f4ffd2 [custom op] fix inductor cpp codegen when returning a list of single tensor (#147649)
For a custom op that returns a list of a single tensor with unbacked symint shape:
```python

@torch.library.custom_op(
    "aoti_custom_ops::fn_ret_list_of_single_tensor", mutates_args={}
)
def fn_ret_list_of_single_tensor(x: torch.Tensor) -> list[torch.Tensor]:
    s = x.sum().to(torch.int64)
    return [torch.randn(s.item())]

@fn_ret_list_of_single_tensor.register_fake
def _(x):
    ctx = torch._custom_op.impl.get_ctx()
    i0 = ctx.new_dynamic_size()
    return [torch.randn(i0)]
```

Before the fix, we have the following error:
```
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp& std::get(const std::variant<_Types ...>&)’
  456 |     auto u0 = std::get<0>(buf1).size(0);
      |               ~~~~~~~~~~~^~~~~~
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note:   expected a type, got ‘0’
In file included from /data/users/yidi/pytorch/torch/include/c10/util/Exception.h:14,
                 from /data/users/yidi/pytorch/torch/include/c10/core/ScalarType.h:5,
                 from /data/users/yidi/pytorch/torch/include/ATen/AccumulateType.h:4,
                 from /data/users/yidi/pytorch/torch/include/ATen/native/Math.h:3,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec_base.h:31,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec512/vec512.h:8,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec.h:4,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional_base.h:6,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional.h:3,
                 from /tmp/tmp5iikarn2/3b/c3bi5gk6mslf6u4iaqafhxm64z6u65e3eain4xlary5blqnvv6xx.h:39,
                 from /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:366:
/usr/include/c++/11/variant:1145:27: note: candidate: ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’
 1145 |     constexpr const _Tp&& get(const variant<_Types...>&& __v)
      |                           ^~~
/usr/include/c++/11/variant:1145:27: note:   template argument deduction/substitution failed:
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’
  456 |     auto u0 = std::get<0>(buf1).size(0);
      |               ~~~~~~~~~~~^~~~~~
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note:   expected a type, got ‘0’
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147649
Approved by: https://github.com/angelayi
ghstack dependencies: #147130
2025-02-25 20:28:41 +00:00
824474cb35 [cond] support output sizes mismatch in front end (#147130)
This PR finishes https://github.com/pytorch/pytorch/pull/137615 by addressing the TODOs and comments left there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147130
Approved by: https://github.com/zou3519
2025-02-25 20:28:41 +00:00
de80b6f0d3 Updated test_cuda.py to rerun tests (#147040)
Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](d4871750d9).

Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-25 19:58:42 +00:00
361b6c97cd cpp_wrapper: Fixup output code indentation (#147215)
Closes #142165.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147215
Approved by: https://github.com/desertfire
ghstack dependencies: #146109, #146424
2025-02-25 19:50:37 +00:00
7c515b2da4 cpp_wrapper: fix test_torchinductor* tests (#146424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146424
Approved by: https://github.com/desertfire
ghstack dependencies: #146109
2025-02-25 19:50:37 +00:00
46d1422afd cpp_wrapper: fix inductor triton tests (#146109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146109
Approved by: https://github.com/desertfire
2025-02-25 19:50:37 +00:00
9740d69e78 [logging] Add toplevel dynamo_compile / tlparse logging for AOTI (#147760)
Summary:
This adds the proper context managers in `compile_fx_aot` such that we get:
1) A toplevel chromium event (i.e., tlparse)
2) A single `dynamo_compile` log entry

Test Plan:
Before:
* Scuba (we only log the dynamo event): https://fburl.com/scuba/dynamo_compile/sandbox/gaqowzrd
* Perfetto trace: https://fburl.com/vol7r6w1

After:
* Scuba (we log the dynamo _and_ compile_fx_aot event): https://fburl.com/scuba/dynamo_compile/sandbox/cx2we8w8
* Perfetto trace (click on the toplevel event to see the additional metadata): https://fburl.com/sziy40r9

Differential Revision: D70113859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147760
Approved by: https://github.com/desertfire
2025-02-25 19:41:39 +00:00
14b9f7f7bc Remove link to search survey (#147751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147751
Approved by: https://github.com/malfet
2025-02-25 19:26:59 +00:00
17766b7aad [inductor][triton] Ignore block ptr advances for removed buffers (#147193)
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 19:14:55 +00:00
ea6938a1f7 Add XuehaiPan to CODEOWNERS for C++ PyTree utilities (#137408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137408
Approved by: https://github.com/zou3519
2025-02-25 18:48:32 +00:00
580f1183b4 Enable ruff rule S324 (#147665)
Fixes #147627

- Add `S324` in `pyproject.toml `
- Running check and clean warnings

```bash
lintrunner --take RUFF --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 18:27:34 +00:00
6061664266 Enabled force_shape_pad for triton tests in test_kernel_benchmark (#147620)
During ROCm runs we naturally have those tests show that padding path will be slower for our archs and the pad_mm chooses to opt out of padding thus failing those tests.

Reasoning for this is per my understanding those tests don't check IF the operation should be padded in the first place, but HOW is it padded and if it's done in a correct way. More than that the tests shouldn't really be hardware dependent or have some condition for them.

Similar PR for reference: https://github.com/pytorch/pytorch/pull/141768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147620
Approved by: https://github.com/jeffdaily, https://github.com/chenyang78, https://github.com/shunting314
2025-02-25 18:06:48 +00:00
651e6aacf9 [ROCm] Remove benign warning about missing amdgpu.ids (#147791)
Fixes #144203.

We build a custom libdrm when preparing our docker image.  We attempt to locate the amdgpu.ids file relative to the python binary, but this is not possible for venv installs of pytorch when the python binary is a symlink.  Not finding amdgpu.ids causes `torch.cuda.get_device_name()` to return "AMD Radeon Graphics" as a generic name instead of something specific such as "AMD Instinct MI250X / MI250".  The libdrm warning is noisy, so we are removing it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147791
Approved by: https://github.com/jeffdaily
2025-02-25 17:17:25 +00:00
e5a13410cd Fix the tiny doc descriptions (#147319)
As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147319
Approved by: https://github.com/zou3519
2025-02-25 17:10:16 +00:00
346bbefa63 [BE] Parameterize TestSDPA in test_mps.py (#147856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147856
Approved by: https://github.com/Skylion007
2025-02-25 16:07:24 +00:00
810d2a3dbd [ARM] Fix bug in _ref_test_helper in test_ops and fix failing test on Aarch64 (#146597)
We have a failing unit test on Aarch64

```
Exception: Caused by reference input at index 34: SampleInput(input=Tensor[size=(5, 5, 4), device="cpu", dtype=torch.complex64, contiguous=False], args=(), kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=34 python test/test_ops.py TestCommonCPU.test_python_ref__refs_square_cpu_complex64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

After debugging it I found that `ex` variable is not being reset to None on each loop inside _ref_test_helper. Which after fixing, highlighted another expectedFailure to reenable - `nn.functional.hinge_embedding_loss` which was incorrectly being skipped due to the same problem.

4a545eb85d/test/test_ops.py (L546)
ex variable is not reset after this for next loop iteration

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146597
Approved by: https://github.com/digantdesai
2025-02-25 14:15:10 +00:00
a695aae89b [MPS] fix attention for >4d tensors (#147545)
Fixes #147443

and adds tests for >4d tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147545
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:55:28 +00:00
0b9da1ae0a [AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803)
Summary: Let CppBuilder handle all the cpp build logic

Differential Revision: [D70146185](https://our.internmc.facebook.com/intern/diff/D70146185)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147803
Approved by: https://github.com/malfet
ghstack dependencies: #147805, #147806, #147807
2025-02-25 13:33:12 +00:00
cc1c9826d4 [AOTI][refactor] Fix a typo (#147807)
Summary: defination -> definition

Differential Revision: [D70146182](https://our.internmc.facebook.com/intern/diff/D70146182)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147807
Approved by: https://github.com/malfet
ghstack dependencies: #147805, #147806
2025-02-25 13:33:12 +00:00
7ed0670e21 [AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147806)
Summary: Consolidate cpp compilation action to CppBuilder. Reland https://github.com/pytorch/pytorch/pull/147680

Differential Revision: [D70146183](https://our.internmc.facebook.com/intern/diff/D70146183)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147806
Approved by: https://github.com/malfet
ghstack dependencies: #147805
2025-02-25 13:33:03 +00:00
2680e835c8 [AOTI][refactor] Rename use_absolute_path to use_relative_path (#147805)
Summary: The option really means to compile a cpp file using its basename instead of the its full path. Reland https://github.com/pytorch/pytorch/pull/147679.

Differential Revision: [D70146184](https://our.internmc.facebook.com/intern/diff/D70146184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147805
Approved by: https://github.com/malfet
2025-02-25 13:32:54 +00:00
7e37fb0a4c [MPS] faster integer matmul for mps (#147526)
There is a naive matmul kernel written for MPS matmul which is used when input types are integer(and some other cases for older MacOSes). The old version of matmul is naive with global memory accesses which really tanks the performance especially when matrix is sufficiently large.

This PR optimizes it (even though there might be more optimizations with using simdgroup matrices which I'll cover in followup since writing that kernel will take more time)

## Performance comparison on M1 Pro:
![performance_comparison](https://github.com/user-attachments/assets/6ea8de5a-8231-4c5b-8dc9-caa79ea6879a)

You can get these numbers by running this script with old kernel compiled and then new kernel compiled(Make sure to change the csv where each output is written):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [32, 128, 512, 1024, 2048, 4096]
num_runs = 10
warmup_runs = 3

def run_int_mm(A, B):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = A @ B
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps")
        B_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_int_mm(A_mps, B_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_int_mm(A_mps, B_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('int_mm_benchmark_times_old.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147526
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:15:18 +00:00
b63c601614 Update merge rules for oneDNN part (#147615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147615
Approved by: https://github.com/atalman
2025-02-25 11:26:59 +00:00
af2d63637e Add option to limit number of SMs used by matmul kernels (#144974)
Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144974
Approved by: https://github.com/eqy, https://github.com/albanD
2025-02-25 10:19:19 +00:00
94969d0a40 [inductor][user triton] Handle scf.yield more accurately (#147762)
**TL;DR**: Previously, the mutation analysis for scf.if/scf.for would bundle all the scf.yield arguments into a single op (the scf.yield), such that a mutation on any returned value from the scf.if/scf.for would register as a mutation to _all_ of the scf.yield args. To fix this, this PR artificially introduces a new scf.yield op for each of the scf.yield args.

**Context**: The relevant kernel is something like this one (added as a test in test_triton_kernels.py)

```python
        @triton.jit
        def branch_with_multiple_yield_args(
            in_ptr0,
            in_ptr1,
            out_ptr,
            conditional_ptr,
            n_elements,
            BLOCK_SIZE: "tl.constexpr",
        ):
            pid = tl.program_id(axis=0)
            block_start = pid * BLOCK_SIZE
            offsets = block_start + tl.arange(0, BLOCK_SIZE)
            mask = offsets < n_elements
            conditional = tl.load(conditional_ptr)
            if conditional:
                in0 = in_ptr0 + 1
                in1 = in_ptr1 + 1
                out = out_ptr + 1
            else:
                in0 = in_ptr0
                in1 = in_ptr1
                out = out_ptr
            x = tl.load(in0 + offsets, mask=mask)
            y = tl.load(in1 + offsets, mask=mask)
            tl.store(out + offsets, x + y, mask=mask)
```

The mutation analysis starts with the `tl.store` - and then does a DFS backwards towards the parameters.  When a new op is encountered in the DFS, the analysis pass recurses on the op's arguments.

The if branch gets converted to TTIR like this:

```mlir
    %21:3 = scf.if %20 -> (!tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32>) {
      ...
      scf.yield %31, %32, %33 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc10)
    } else {
      scf.yield %arg0, %arg1, %arg2 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc11)
    } loc(#loc7)
```

and so the "source" op of the `out` variable is marked as the `scf.yield` op - and then all of the arguments to `scf.yield` are marked as mutable (including arg0, arg1, and arg2 - only one of which is actually mutated).

**This PR** we duplicate the `scf.yield` to add one `scf.yield` per return value. That way we avoid marking all the returns from the scf.if/scf.for as mutated when only some are.

Differential Revision: [D70118202](https://our.internmc.facebook.com/intern/diff/D70118202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147762
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-02-25 08:41:00 +00:00
7bd2e3bca1 Update torch-xpu-ops commit pin (#147743)
Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](306a0ffb6e), includes:

- Bugfix (LayerNorm/Nonzeros)
- Update AOT target

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743
Approved by: https://github.com/EikanWang
2025-02-25 08:06:35 +00:00
866dc45d3c [Inductor][ROCm][CK] Unhardedcoded kernel shapes for ck_conv_template codegen (#147504)
## [Inductor][ROCm][CK] Parameterize `ck_conv_template` Codegen

### Description
Previously, ROCm CK kernel codegen templates were hardcoded with fixed values for convolution parameters:

- `index_t GroupCount`
- `index_t NBatch`
- `index_t NOutChannels`
- `index_t NInChannels`
- `vector<index_t> FilterSize`
- `vector<index_t> InputSize`
- `vector<index_t> ConvolutionStrides`
- `vector<index_t> Dilations`
- `vector<index_t> LeftPads`
- `vector<index_t> RightPads`

This PR updates `ck_conv_template` to accept these parameters dynamically from Inductor. By doing so, we reduce the number of generated templates, improving flexibility and maintainability.

### Testing
- Verified correctness by running relevant test cases, i.e `test/inductor/test_ck_backend.py`
- Ensured generated kernels reflect the updated parameterization, i.e generated templates in `/tmp/torchinductor_root/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147504
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/tenpercent

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-02-25 07:48:07 +00:00
d73b927662 [DSD] Fixes issue when there is a PG without parameters (#147730)
Fixes https://github.com/pytorch/pytorch/issues/143828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147730
Approved by: https://github.com/mori360
2025-02-25 07:25:38 +00:00
fb73b0c7c5 Revert "use copy2d in h2d/d2h copy when possible (#146256)"
This reverts commit 0bc036a9e98d2cc92ff9dd367342b1f2efcc15f0.

Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))
2025-02-25 07:06:38 +00:00
bb7e8fbd66 [CacheBench] Add hf_T5 llama moco to cachebench (#147783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147783
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781, #147782
2025-02-25 04:34:45 +00:00
895564d6b6 [CacheBench] Add huggingface (#147782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147782
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781
2025-02-25 04:34:45 +00:00
c4fb6ae55d [CacheBench] Separate dynamic into its own option (#147781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147781
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780
2025-02-25 04:34:34 +00:00
60d4cbfc06 [CacheBench] Add repeat option so that we can have more accurate cache results (#147780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147780
Approved by: https://github.com/huydhn
ghstack dependencies: #147688
2025-02-25 04:34:25 +00:00
ab3b814af3 [CacheBench] Add ciflow/trunk test (#147688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147688
Approved by: https://github.com/huydhn
2025-02-25 04:34:16 +00:00
4b7604ec10 Delete Mixed MM Special Casing (#147151)
Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup:

- prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not.
- similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path.

It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication.

The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete.

Future optimizations could include:

- cutlass prologue path
- making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit.

Differential Revision: [D70114858](https://our.internmc.facebook.com/intern/diff/D70114858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2025-02-25 04:29:54 +00:00
890213f65f Revert "[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679)"
This reverts commit 0b52d801d2297ad6c38e631eedfd4dead9360e1b.

Reverted https://github.com/pytorch/pytorch/pull/147679 on behalf of https://github.com/desertfire due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147679#issuecomment-2680389225))
2025-02-25 04:11:13 +00:00
9b06b30468 Revert "[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680)"
This reverts commit 22fae0d948ac14c72b510fafc2283072d744dff9.

Reverted https://github.com/pytorch/pytorch/pull/147680 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147680#issuecomment-2680383986))
2025-02-25 04:06:40 +00:00
9478c90e2b [Quant] flip: throw runtime error for QUInt4x2 and QUInt2x4 input (#147430)
Fixes #147208

**Summary**
The `flip` op causes memory corruption for `torch.quint4x2` and `torch.quint2x4` inputs. It is because the TensorIterator-based implementation does not support multiple elements per byte. And `torch.quint4x2` and `torch.quint2x4` are deprecated in PyTorch. So, we add a check here to throw a runtime error if input dtyps is `torch.quint4x2` or `torch.quint2x4`.

**Test plan**
```
pytest -s test/test_shape_ops.py -k test_flip_unsupported_dtype
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147430
Approved by: https://github.com/mingfeima, https://github.com/ngimel
2025-02-25 03:47:40 +00:00
20295c017e Fix import of getArtifactLogger for ir_pre_fusion and ir_post_fusion (#147560)
Fixes #147002

There was an issue with the previous PR https://github.com/pytorch/pytorch/pull/147248 that didn't show up in CI,
where a logging import was not complete in torch/_inductor/debug.py before importing it.
This only happened if someone directly imported the file without doing any other imports before.

Also set to off_by_default by request to reduce log spew.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147560
Approved by: https://github.com/Skylion007
2025-02-25 03:36:08 +00:00
e34c15a05b torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-25 03:32:22 +00:00
cyy
8f728e28dd Enable ASAN in CUDA tests (#147512)
It should work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147512
Approved by: https://github.com/soulitzer
2025-02-25 02:58:39 +00:00
b0fa92042b Fix torch.mean out dtype check (#147188)
**For CPU**:
Type promotion is supported for torch.mean

**For Meta**:
Not supported for torch.mean

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147188
Approved by: https://github.com/albanD
2025-02-25 02:50:03 +00:00
33ff96b3f9 cpp_builder: unbreak clang++ detection (#147775)
Fixes an issue where `_is_gcc` would match on `clang++` due to the string ending with `g++`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147775
Approved by: https://github.com/desertfire
2025-02-25 02:33:01 +00:00
dacdc9782b [Inductor] Add input value checking to randint meta function (#147191)
Fixes #147070

Adding value checking for the range to the meta function, similar to which in the CUDA/CPU aten op.

Test with
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_tensor_creation_ops.py -k test_randint_inference
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147191
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-25 02:18:16 +00:00
c644f4c5fe [Inductor] Fix the decompositions of torch isin (#147519)
**Summary**
Fixed two decomposition issues in `torch.isin`:

- Issue 1: As reported in [#147329](https://github.com/pytorch/pytorch/issues/147329), the current decomposition does not support cases where test_element is a scalar. This is now implemented by referring to the ead970c8d0/aten/src/ATen/native/TensorCompare.cpp (L1004-L1008)

- Issue 2: Found while enabling a unit test with `elements = 1` and `test_elements = torch.tensor([1, 2, 3, 4])`, where Inductor produced different results compared to eager mode. This issue is fixed by referring to ead970c8d0/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L329-L338)

**Test Plan**
```
python test/inductor/test_torchinductor.py -k test_isin_tensor_scalar
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147519
Approved by: https://github.com/jgong5, https://github.com/FFFrog, https://github.com/peterbell10
2025-02-25 01:49:44 +00:00
2c8cd41c1f Delete unused conda-aws-upload environment (#147792)
As this environment only contains keys for Anaconda uploads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147792
Approved by: https://github.com/atalman
2025-02-25 01:42:52 +00:00
43074680b5 [ROCm] Add support for gfx1102 arch to wheel builds. (#147761)
[gfx1102 is not officially supported](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) but most ROCm libs have gfx1102 code objects available since ROCm 5.5.  Now that we're using `--offload-compress` we can fit another gfx target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147761
Approved by: https://github.com/jeffdaily
2025-02-25 01:35:52 +00:00
97557b9833 [Inductor] Update set_driver_to_gpu code to avoid backend re-initialization with new Triton (#147621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147621
Approved by: https://github.com/jansel
2025-02-25 00:04:54 +00:00
55bf3ff3a5 [Docs] Add OpDTypes.any_common_cpu_cuda_one (#147605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147605
Approved by: https://github.com/soulitzer
2025-02-24 23:23:43 +00:00
e72b4c61bf Revert "Upgrade submodule oneDNN to v3.7 (#147498)"
This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c.

Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))
2025-02-24 22:57:39 +00:00
81dccd706b [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-24 22:47:52 +00:00
a71d8b7246 Fix ReferenceError: weakly-referenced object no longer exists in cycle detector (#146922)
Summary: weakref.proxy objects will throw errors when they re dead.  We just do not bother visulaizing them. They are weak, so they aren't relevant to cycles anyway.

Differential Revision: D69270429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146922
Approved by: https://github.com/tianfengfrank, https://github.com/Chillee
2025-02-24 22:27:39 +00:00
22fae0d948 [AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680)
Consolidate cpp compilation action to CppBuilder

Differential Revision: [D69723632](https://our.internmc.facebook.com/intern/diff/D69723632/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147680
Approved by: https://github.com/yushangdi, https://github.com/angelayi
ghstack dependencies: #147679
2025-02-24 21:45:15 +00:00
0b52d801d2 [AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679)
The option really means to compile a cpp file using its basename instead of the its full path.

Differential Revision: [D69722709](https://our.internmc.facebook.com/intern/diff/D69722709/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147679
Approved by: https://github.com/angelayi
2025-02-24 21:44:33 +00:00
96acb56626 [ROCm] Optimize the stride one indexing backwards kernel (#146420)
This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <Hashem.Hashemi@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-24 21:19:06 +00:00
89b9c12de8 remove prints from partitioner (#147749)
See c57894cd74..22d8f9a657 (r1968015955)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147749
Approved by: https://github.com/Skylion007, https://github.com/laithsakka
2025-02-24 21:03:45 +00:00
8eb400ef66 [BE] TCPStore: use typed errors for assertions (#147647)
This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend  to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python.

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647
Approved by: https://github.com/fduwjj
2025-02-24 20:58:10 +00:00
19fd21fe7e [Inductor] Hot fix after #146917 (#147639)
This pull request reverts the changes to `torch/_inductor/ir.py` file that were added in #146917.

Where I tested, there were changes only from `torch/_inductor/codegen/cpp_wrapper_gpu.py`, it turns out that changes in `torch/_inductor/ir.py` file are not really needed. So it's my fault, I didn't sync the environments (between several machines) correctly.

@davidberard98 @YUNQIUGUO maybe that's why the tests on CUDA didn't pass?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147639
Approved by: https://github.com/etaf, https://github.com/davidberard98
2025-02-24 20:34:48 +00:00
754fb834db [BE][CI] bump ruff to 0.9.0: string quote styles (#144569)
Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting

- Change the outer quotes to double quotes for nested f-strings

```diff
- f'{", ".join(args)}'
+ f"{', '.join(args)}"
```

- Change the inner quotes to double quotes for triple f-strings

```diff
  string = """
-     {', '.join(args)}
+     {", ".join(args)}
  """
```

- Join implicitly concatenated strings

```diff
- string = "short string " "short string " f"{var}"
+ string = f"short string short string {var}"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569
Approved by: https://github.com/Skylion007
ghstack dependencies: #146509
2025-02-24 19:56:09 +00:00
52f6d4aa30 [BE][CI][Easy] bump ruff to 0.9.0: long statements in docstrings (#146509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146509
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-24 19:56:08 +00:00
9605c5063b [ROCm][TunableOp] Speed-up matmul_small_brute_force_tunableop unit test (#147659)
This PR has a UT speed-up and some refactoring of tests.

A previous PR https://github.com/pytorch/pytorch/pull/142422 fixed this matmul_small_brute_force_tunableop for the FP16 data type by adding TunableOp numerical checks. It had the unfortunate side effect that it increased the execution time for the FP32 and FP64 data types by a significant margin. This PR *reduces* the execution time by 20+ minutes.

We also move a hipBLASLt version check to a different tunableop UT for simplicity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147659
Approved by: https://github.com/jeffdaily
2025-02-24 19:44:38 +00:00
69c4f6ff13 [Minor] Fix minor mistake in docstring of replace_pattern (#147611)
Fixes #147610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147611
Approved by: https://github.com/soulitzer
2025-02-24 19:33:44 +00:00
b9b1fd9b93 [Intel GPU] qlinear.pointwise with mixed dtype support (#136753)
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_int8_mixed_bf16_xpu \
    -k test_qlinear_relu_int8_mixed_bf16_xpu \
    -k test_qlinear_add_int8_mixed_bf16_xpu
```

# Runtime exemplification
```bash
#qlinear+bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242
# qlinear_add + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922
# qlinear_add_relu + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277
```
As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136753
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337, #135465

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-24 19:27:50 +00:00
075b91bef1 [Intel GPU] qconv.pointwise with mixed dtype XPU support (#135465)
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qconv_pointwise` op. With current PR, we allow `qconv` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qconv2d_int8_mixed_bf16_xpu \
    -k test_qconv2d_relu_int8_mixed_bf16_xpu \
    -k test_qconv2d_hardtanh_int8_mixed_bf16_xpu \
    -k test_qconv2d_hardswish_int8_mixed_bf16_xpu \
    -k test_qconv2d_silu_int8_mixed_bf16_xpu \
    -k test_qconv2d_add_int8_mixed_bf16_xpu \
    -k test_qconv2d_add_relu_int8_mixed_bf16_xpu
```

# Runtime verification
```bash
#qconv + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0539551
# qconv_silu + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0588379
# qconv_hardswish + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_hardswish:0.166667:0.5,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0568848
```
The `dst_bf16::blocked:acdb::f0` attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135465
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-24 19:27:50 +00:00
ffa19b9024 [ROCm][Windows] Fix unrecognized constexpr std::memcpy for HIP-clang (#147316)
Since in MSVC's 2019/2022 implementation of STL memcpy is not defined as a constexpr function, HIP clang compiler on Windows cannot evaluate the following memcopy as one that could be resolved during the compile time. To resolve this, a `__builtin_memcpy` is used instead which doesn't have this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147316
Approved by: https://github.com/jeffdaily
2025-02-24 18:28:59 +00:00
900a774781 Revert "[ROCm] Update periodic.yml to use 2GPU runners (#146839)"
This reverts commit b6273d7f4ba4fbb126eb96816287641ca1e4efc6.

Reverted https://github.com/pytorch/pytorch/pull/146839 on behalf of https://github.com/jithunnair-amd due to This change is not needed anymore since our 4-GPU runners are back online and stable so far ([comment](https://github.com/pytorch/pytorch/pull/146839#issuecomment-2679145448))
2025-02-24 17:17:58 +00:00
cde12207a0 [Intel GPU] Add SDPA implementation on XPU with OneDNN (#147612)
Add XPU implementation of OneDNN based SDPA operator. Will be integrated and enabled later.

Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147612
Approved by: https://github.com/EikanWang
2025-02-24 16:12:04 +00:00
576ed1e400 Upgrade submodule oneDNN to v3.7 (#147498)
This PR is to upgrade submodule oneDNN to v3.7.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498
Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
2025-02-24 14:32:51 +00:00
80d3afc698 [inductor] Improve type annotations in _inductor/pattern_matcher.py (#146626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146626
Approved by: https://github.com/Skylion007
2025-02-24 14:30:35 +00:00
d0f08dc3eb Update slow tests (#147728)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147728
Approved by: https://github.com/pytorchbot
2025-02-24 11:48:19 +00:00
cba14212e6 [FX] micro-optimization map_aggregate(immutable_dict) (#147691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147691
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #147699, #144640
2025-02-24 09:14:08 +00:00
a50af71fb6 [FX] Refactor immutable collections implementation (#144640)
Get rid of dynamic class creation via `type(name, bases, ...)`. Convert it to classic static class definition for better readability and static analysis support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144640
Approved by: https://github.com/jansel
ghstack dependencies: #147699
2025-02-24 09:14:08 +00:00
dc9a03d30c [Window] Fix invalid file path on windows. (#147708)
This PR aims to fix the invalid path for windows: `C:\\Users\\sdp\\AppData\\Local\\Temp\\tmp0wugz2qm\\dynamo\\code_state___main__.TestFxGraphCache.test_cache_hot_load_pgo:None:.pkl.lock`
Windows does not allow chars `\ / : * ? " < > |` in a path.

And this PR also replace `os.rename` to `os.replace` in torch/_dynamo/pgo.py because `os.replace` allows target file exists on Windows, but not `os.rename` .
| Function                      | `os.rename()`              | `os.replace()`             |
|--------------------------------|----------------------------|----------------------------|
| Rename a file                 |                           |                           |
| Move a file                   |                           |                           |
| Overwrite an existing file     |  (Error on Windows)       |  (Will overwrite)         |
| Overwrite an existing directory |  (Error on Windows)                         |   (Error on Windows)                        |
| Move across disks             |                           |                           |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147708
Approved by: https://github.com/jansel
2025-02-24 08:31:11 +00:00
5b6ad682bc Revert "[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)"
This reverts commit 85ea67983421acc30ccc76f7a159042e75c6ea08.

Reverted https://github.com/pytorch/pytorch/pull/147254 on behalf of https://github.com/jeanschmidt due to introduced reds on main ([comment](https://github.com/pytorch/pytorch/pull/147254#issuecomment-2677700862))
2025-02-24 08:20:16 +00:00
8d618f3da7 [AOTI][XPU] Suppress multi-line comment warning for XPU. (#147710)
This PR aim to suppress multi-line comment waring in sycl header when building Inductor cpp_wrapper .
```
/intel/oneapi/compiler/2025.0/include/sycl/detail/builtins/builtins.hpp:235:1: warning: multi-line comment [-Wcomment]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147710
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-02-24 07:28:59 +00:00
cee03b7746 [Inductor] Update should_decompose_mm condition for CPU (#147673)
Summary:
Previously, for cpu we decompose addmm if
```
check_device(mat1, mat2, device="cpu")
        and mat1.shape[0] == 1
        and mat2.shape[0] <= 64
        and mat2.shape[1] <= 16
```
We have a new case where `mat2.shape[2] = 304`, and benchmark shows that it will beneficial if we decompose, so update the condition to
```
check_device(mat1, mat2, device="cpu")
        and mat1.shape[0] == 1
        and mat2.shape[0] <= 64
        and mat2.shape[1] <= 512
```

Differential Revision: D70033166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147673
Approved by: https://github.com/houseroad
2025-02-24 05:51:50 +00:00
8b65dbad13 [MPS/Inductor] Add support for xlog1py. (#147709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147709
Approved by: https://github.com/jansel
2025-02-24 05:28:52 +00:00
baccadb2f1 xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431)
Initially discussed here: https://github.com/pytorch/pytorch/pull/132945#discussion_r1957366131

Previously `torch.xpu.get_arch_list()` got relaxed to work even if XPU device is not available. However, we overlooked the case when pytorch is not compiled with XPU support. In such a case function throws an exception. This commit adjusts this behavior and makes function return `[]` even if pytorch is not compiled with XPU support.

CC: @EikanWang @fengyuan14 @guangyey @malfet @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147431
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-02-24 01:35:54 +00:00
7c52ef2424 Add XPU to is_compile_supported to support roi_align op in torchvision (#147541)
Part of the required fix for https://github.com/intel/torch-xpu-ops/issues/1264.

To support `roi_align`, torchvision uses `is_compile_supported` in `torch/_dynamo/utils.py` to compile a non-deterministic version of the op for backwards passes. This PR adds XPU device to the supported compile devices.

The `is_compile_supported()` util function has extremely limited usage, only being used in `torchvision.ops.roi_align` and `torch.utils._content_store.has_storage()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147541
Approved by: https://github.com/guangyey, https://github.com/jansel

Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
2025-02-24 01:32:36 +00:00
4e934ee5a7 [MPS] Add eager support for xlog1py. (#147687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147687
Approved by: https://github.com/malfet
2025-02-24 01:23:59 +00:00
eqy
718cf68aee [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-23 22:01:39 +00:00
b5d7aefa57 [BE] add missing overload annotations for tree_map_only (#147699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147699
Approved by: https://github.com/Skylion007
2025-02-23 20:21:07 +00:00
f47573f70d Add super().setUp() to some test cases (#147651)
I saw that their disabled issues were getting spammed with comments, meaning that they were still running in CI despite having a disable issue, so I added the super().setUp() call to check if there's a disable issue for them since they were missing it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147651
Approved by: https://github.com/huydhn
2025-02-23 18:21:17 +00:00
f03e7f3801 [MPS] Workaround rng bug for 5D tensors (#147667)
For some reason MPSGraph returns repeated values is tensor dimention is
larger than 4, which can be clearly seen by running following
```swift
import Metal
import MetalPerformanceShadersGraph

func randMPS(device: MTLDevice, obuf: MTLBuffer, nelem: Int, ndim: Int = 5) {
  let graph = MPSGraph()
  var dims = Array(repeating: 1, count: ndim)
  dims[0] = nelem
  let shape = dims.map { NSNumber(value: $0) }
  let randNode = graph.randomUniformTensor(withShape: shape, seed: 42, name: nil)
  let mpsOutputBuffer = MPSGraphTensorData(obuf, shape: shape, dataType: .float32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [:], targetOperations: nil, resultsDictionary: [randNode: mpsOutputBuffer])
}

func printBuf(_ prefix: String, buf: MTLBuffer, nelem: Int) {
  let buf_data = buf.contents().assumingMemoryBound(to: Float.self)
  print(prefix)
  for i in 0..<nelem {
      print(buf_data[i], terminator: i != nelem - 1 ? " " : "\n")
  }
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let nelem = 2
guard let buf = device.makeBuffer(length:nelem * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 4)
printBuf("4D uniform", buf: buf, nelem: nelem)

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 5)
printBuf("5D uniform", buf: buf, nelem: nelem)
```

Workaround by flatting the tensor if it's contiguous

Fixes https://github.com/pytorch/pytorch/issues/147624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147667
Approved by: https://github.com/dcci
2025-02-23 16:52:01 +00:00
3e2d9d079e Revert "[ROCm] OCP FP8 Support for new GPUs (#146632)"
This reverts commit f95ab46797e1f3e8cc48ce2f45e4f6985132fb19.

Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))
2025-02-23 12:04:50 +00:00
d0adff761e Propagate AttributeError to user code in user_defined.py (#146497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146497
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #146496
2025-02-23 01:18:28 +00:00
8c761ac7e3 Handle is/is not (#146496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146496
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-02-23 01:18:28 +00:00
b084635735 [MPS/inductor] Adjust more tests that depends on non-divisible input sizes (#147681)
Also adjust a comment while I'm at it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147681
Approved by: https://github.com/jansel
2025-02-23 00:33:26 +00:00
6a5e3917a7 [MPS] Add inductor support for spherical_bessel_j0. (#147650)
Counterpart to my previous patch that added support for the op in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147650
Approved by: https://github.com/jansel
2025-02-23 00:32:36 +00:00
f9c117f859 [mps/inductor] XFAIL adaptive_avg_pool_with_output_size_0. (#147676)
Non-divisible input sizes are not implemented on MPS device yet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147676
Approved by: https://github.com/malfet
2025-02-22 20:17:33 +00:00
db15cb0988 [Submodule] [Cutlass] Update to 3.8.0 tag (#147655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655
Approved by: https://github.com/henrylhtsang, https://github.com/eqy
2025-02-22 20:05:31 +00:00
85ea679834 [TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)
[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)

Summary:

# context
* more details in the [post](https://fb.workplace.com/groups/1075192433118967/permalink/1587079018596970/)
* disable contextlib with PT2

Test Plan:
* run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+dynamo,+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_ultra_mini training.pipeline_type=pt2 data_loader.dataset.table_ds=[2024-12-02] 2>&1 | tee -a output.log
```
* old tlparse
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpYYAS3o/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
* new tlparse
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpUJhCGZ/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Reviewed By: Microve

Differential Revision: D68480678
2025-02-22 18:57:55 +01:00
fa8e3a28a7 Revert "[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)"
This reverts commit 533b884870acd951e684e0bf551eb76904dec047.

Reverted https://github.com/pytorch/pytorch/pull/141178 on behalf of https://github.com/jeanschmidt due to Broke internal arvr signals, see D69971019. @jbschlosser please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/141178#issuecomment-2676317470))
2025-02-22 17:28:12 +00:00
bea72180ed Revert "[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)"
This reverts commit e7bf490c430ac5a70ccb7ab8e954d3386fd29413.

Reverted https://github.com/pytorch/pytorch/pull/144572 on behalf of https://github.com/jeanschmidt due to Broke internal signals, D69994027, I'll find someone to help get this change merged ([comment](https://github.com/pytorch/pytorch/pull/144572#issuecomment-2676314308))
2025-02-22 17:19:38 +00:00
3409cbd177 Revert "Delete Mixed MM Special Casing (#147151)"
This reverts commit d6bb1d7f0a9dc3d11d2864da9ab46872377a6e52.

Reverted https://github.com/pytorch/pytorch/pull/147151 on behalf of https://github.com/jeanschmidt due to Broke a few internal signals, see comments on D69994157 ([comment](https://github.com/pytorch/pytorch/pull/147151#issuecomment-2676312215))
2025-02-22 17:14:32 +00:00
72b4f35cb5 [CI] Reduce the AOT target list to reduce build time (#147601)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147601
Approved by: https://github.com/atalman
2025-02-22 14:43:26 +00:00
3cc3d7e08f Also support non-contiguous activation for torch._weight_int8pack_mm on CPU (#147588)
### Problem
Non-contiguous activation for `torch._weight_int8pack_mm` is unsupported on CPU.
So, with int8 WoQ with B16 activation with torchao, for batch-size 2 & above, an assertion is hit regarding non-contiguous A being unsupported. Such an issue was encountered with LLaMA models.

### Solution
Also support non-contiguous activation for `torch._weight_int8pack_mm`, so long as it's contiguous on the last dimension & remove the assertion that requires contiguous activation.

### Alternative solutions considered
Could modify LLaMA model in transformers library to call `contiguous` after obtaining the final hidden state, just before computing logits with the LM head. However, [it](https://github.com/huggingface/transformers/pull/36078) might cause some regression for other users of that code.

Another aspect to this issue is - is latency always lower if we make an activation tensor contiguous before linear or `torch._weight_int8pack_mm` is called on CPU? I guess we need some data-points to analyze this part, although I think the performance should be good enough with this patch, since the first cache lines of rows of A are being explicitly prefetched in the existing code (and it also avoids copy, which a `contiguous` call would do).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147588
Approved by: https://github.com/mingfeima, https://github.com/leslie-fang-intel, https://github.com/malfet
2025-02-22 08:29:07 +00:00
e1bf892d90 [DDP] Temporarily disable comm mem (#147663)
For fear that it incur slightly more memory usage and cause some applications at tight memory margin to OOM.
(bc the comm mem pool is a separate pool than the regular pool ?)

Differential Revision: [D70026681](https://our.internmc.facebook.com/intern/diff/D70026681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147663
Approved by: https://github.com/d4l3k
2025-02-22 05:55:43 +00:00
086d146f6f Update ruff linter for PEP585 (#147540)
This turns on PEP585 enforcement in RUFF.

- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-22 04:45:17 +00:00
77d2780657 Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549)
running python test/strobelight/examples/compile_time_profile_example.py
```
strobelight_compile_time_profiler, line 123, 2025-02-20 14:08:08,409, INFO: compile time strobelight profiling enabled
strobelight_compile_time_profiler, line 159, 2025-02-20 14:08:08,409, INFO: Unique sample tag for this run is: 2025-02-20-14:08:081656673devgpu005.nha1.facebook.com
strobelight_compile_time_profiler, line 160, 2025-02-20 14:08:09,124, INFO: URL to access the strobelight profile at the end of the run: https://fburl.com/scuba/pyperf_experimental/on_demand/9felqj0i

strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:12,436, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:15,553, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:16,170, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 214, 2025-02-20 14:08:16,877, INFO: profiling frame 1/0
strobelight_function_profiler, line 247, 2025-02-20 14:08:19,416, INFO: strobelight run id is: 4015948658689996
strobelight_function_profiler, line 249, 2025-02-20 14:08:21,546, INFO: strobelight profiling running
strobelight_function_profiler, line 289, 2025-02-20 14:08:25,964, INFO: work function took 4.417063233006047 seconds
strobelight_function_profiler, line 230, 2025-02-20 14:08:28,310, INFO: strobelight profiling stopped
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Total samples: 119
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/73h2f7ur
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/zs06fi9e
strobelight_compile_time_profiler, line 167, 2025-02-20 14:08:44,308, INFO: 1 strobelight success runs out of 1 non-recursive compilation events.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147549
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #147547
2025-02-22 03:44:53 +00:00
fc095a885c move _strobelight/example to avoid graph breaks (#147547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147547
Approved by: https://github.com/bobrenjc93
2025-02-22 03:44:53 +00:00
fecd3f7ecb [ROCm] change is_hip_clang() to always return True (#147646)
hipify is replacing kernel launchs <<< >>> with hipLaunchKernelGGL() macro and this is a regression caused by /opt/rocm/hip/.hipinfo no longer existing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147646
Approved by: https://github.com/jeffdaily, https://github.com/petrex
2025-02-22 03:26:55 +00:00
b11d5cd584 [Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481)
This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #147347
2025-02-22 02:53:16 +00:00
2d433cf1ad [Inductor UT][Windows][XPU] Enable Inductor UT on XPU Windows. (#147347)
This PR removes the restrictions on general cases for XPU on Windows, allowing us to run Inductor UT on Windows.
Additionally, this series of PRs has also fixed all XPU Inductor UT issues on Windows. However, due to resource constraints, we have not yet set up a Windows CI pipeline online.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147347
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-02-22 02:53:16 +00:00
84fcf1bb11 constexpr all the things in irange.h (#147633)
I got complaints while irangeifying some files in ExecuTorch
that irange could not be used in a constexpr function. This made the
complaints go away.

I added a constexpr function in irange_test that used to fail to build
with `error: variable of non-literal type 'iterator' (aka
'integer_iterator<int, true>') cannot be defined in a constexpr
function before C++23` and now builds fine.

Differential Revision: [D69959614](https://our.internmc.facebook.com/intern/diff/D69959614/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147633
Approved by: https://github.com/albanD
2025-02-22 01:51:51 +00:00
6e0b09728a [export] Remove report from draft-export output (#147558)
Summary: This matches the export API. To print the report, people can just do `print(ep._report)`. This information is also displayed in the terminal after the draft_export call.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D69689154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147558
Approved by: https://github.com/pianpwk
2025-02-22 00:54:29 +00:00
1c334893dc [CacheBench] Refactor code to prepare for mode benchmarks (#147641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147641
Approved by: https://github.com/huydhn
2025-02-22 00:20:54 +00:00
5d26b7108f [PP] Remove extra code and docs BE (#147636)
current docs:
<img width="746" alt="image" src="https://github.com/user-attachments/assets/4c4088fc-ee97-4a82-be28-e33eb35e76f5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147636
Approved by: https://github.com/awgu
2025-02-22 00:10:31 +00:00
f95ab46797 [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 23:44:08 +00:00
b1a81a4a65 Don't use '-e' when installing Triton (#147228)
Currently the install_triton.sh script uses "pip install -e ." to install Triton.
Using the -e is sometimes appropriate for develop work but is less appropriate for delivery.
To make matters worse it seems the behavior of the -e various depending on the version of pip invovled.

This PR removes the -e and installs Triton normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147228
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-21 23:00:12 +00:00
995b125cdd [CI] Build sm89 with more procs experiment (#147487)
Add a build that uses 4 out of the 8 processes available on a linux.2xlarge/c5.2xlarge.  Currently it's set to 2 because it would oom, but I'm curious as to how often people's builds oom.  I can't test this on my own because of caching, so it has to run on pull request

This might result in a failing job on may people's PRs and I'm not sure how to get around it.  I named it stable to make it automatically get sorted into the stable group for Dr. CI but it'll still show up
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147487
Approved by: https://github.com/huydhn
2025-02-21 22:07:00 +00:00
7c8c82cd64 [trymerge] Post initial starting merge comment on stacked PRs (#147028)
Post a small comment stating if a PR is being merged as part of a stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028
Approved by: https://github.com/ZainRizvi
2025-02-21 22:05:00 +00:00
698f6f9fae specify only some dimensions in shapes collection (#147534)
Differential Revision: [D69936316](https://our.internmc.facebook.com/intern/diff/D69936316/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147534
Approved by: https://github.com/bobrenjc93
2025-02-21 22:02:42 +00:00
2fb9416e6f [inductor][cpu] Move VNNI weight packing into AMX GEMM kernel for contiguous BMM weights (#146843)
Currently, the bfloat16 microkernel that uses AMX vectorization requires that the weights are in an interleaved VNNI format. For GEMM code, this hasn't been an issue because GEMM currently only supports constant weights, so the VNNI weight packing is done during compile-time and saved as a constant tensor to the graph. But for BMM ops where weights are not required to be constant, current code does an expensive reshape/VNNI packing for all BMM weights.

This PR removes the need for the reshape/packing for non-constant inputs by moving VNNI packing inside the AMX microkernel. A new `K * block_n` buffer is used to store the temporary packed weights. Weight packing involves interleaving 2 rows of weights.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146843
Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-21 21:46:00 +00:00
d91be786cb [cutlass backend] clear_on_fresh_inductor_cache when generatings cutlass ops (#147586)
Differential Revision: [D69966732](https://our.internmc.facebook.com/intern/diff/D69966732/)

This is needed if we want to generate cutlass ops with different instantiation level in one session.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147586
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-02-21 21:28:41 +00:00
ef6b16ea9d Revert "[trymerge] Post initial starting merge comment on stacked PRs (#147028)"
This reverts commit 0295aabf6071c7da62325e6a29e04ed09a3e34ef.

Reverted https://github.com/pytorch/pytorch/pull/147028 on behalf of https://github.com/clee2000 due to I think this broke merge for non ghstack prs ([comment](https://github.com/pytorch/pytorch/pull/147028#issuecomment-2675532017))
2025-02-21 21:02:19 +00:00
05e6f15966 Revert "[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)"
This reverts commit e758d8b4d1632ea765bf8bc8e87b6039ae708b9f.

Reverted https://github.com/pytorch/pytorch/pull/147395 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D69890757 - servicelab_benchmark_pyper_local_runner, @eellison please help the author get this change landed ([comment](https://github.com/pytorch/pytorch/pull/147395#issuecomment-2675521966))
2025-02-21 20:56:40 +00:00
6eb795c9e8 [associative_scan] compile backend change to "eager" (#146973)
This PR fixes some issues with torch export discussed here: https://github.com/pytorch/pytorch/pull/140043#discussion_r1941932960

However, this backend change does still not resolve the failure for specific shapes mentioned here: https://github.com/pytorch/pytorch/issues/137943#issuecomment-2649564994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146973
Approved by: https://github.com/ydwu4
2025-02-21 20:21:41 +00:00
5ed1e23e3a Fix type stubs for SymmetricMemory (#146310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146310
Approved by: https://github.com/yifuwang
2025-02-21 19:59:43 +00:00
fd8ae1aa04 [ROCm] gfx940 and gfx941 cleanup (#147394)
Removing gfx architectures not supported by ROCm.

NOTE: For users wanting to build PyTorch for gfx archs that are *not* supported by the official wheels on download.pytorch.org, you can build PyTorch from source for your desired gfx arch [using the PYTORCH_ROCM_ARCH env var](https://github.com/pytorch/pytorch/blob/main/README.md#amd-rocm-support).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147394
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-02-21 19:42:12 +00:00
c0ee62573a [Easy][optim] Add LBFGS params optional desc (#147579)
[LBFGS docs](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#torch.optim.LBFGS) missing `optional` description for params in compare with other optimizer docs, like [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/34877490-16b4-4c68-bf6c-405bae563352)

### After

![image](https://github.com/user-attachments/assets/7fba94c8-7091-47b8-bdf1-ca7d779a027f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147579
Approved by: https://github.com/janeyx99
2025-02-21 19:38:10 +00:00
b5c3bb6185 Add continuous run for cachebench (#147546)
This PR adds a continuous run for cache bench.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147546
Approved by: https://github.com/huydhn
ghstack dependencies: #147537
2025-02-21 19:02:17 +00:00
76ce194b8e For addmm and bmm, check if config.autotune_fallback_to_aten before using aten as a fallback. Also fix bmm cutlass backend (#147148)
This PR also fixes BMM, which was silently failing for a while.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147148
Approved by: https://github.com/eellison
2025-02-21 18:41:52 +00:00
0295aabf60 [trymerge] Post initial starting merge comment on stacked PRs (#147028)
Post a small comment stating if a PR is being merged as part of a stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028
Approved by: https://github.com/ZainRizvi
2025-02-21 18:05:05 +00:00
2190ca7f47 Use __qualname__ in add_safe_globals and update Unpickling error raised for Unsupported GLOBAL (#146815)
- Fixes #146814

Change
```python
for f in _marked_safe_globals_set:
    module, name = f.__module__, f.__name__
```
to

```python
for f in _marked_safe_globals_set:
    module, name = f.__module__, f.__qualname__
```
for avoiding same key string overwrite.

A test is also added.
```
python test/test_serialization.py TestSerialization.test_serialization_nested_class
```

- Fixes #146886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146815
Approved by: https://github.com/mikaylagawarecki
2025-02-21 18:04:59 +00:00
f4e4cfcb91 [caffe2] Ignore compiler option when building using clang (#147556)
Summary:
Skip adding unrecognized option optimize("-fno-tree-loop-vectorize") when building using clang

This piece of code began to be compiled after armv9a has been set as default compilation profile

Test Plan: buck2 run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12 lego/scripts:lego_cli -- run-locally --model_entity_id ${MODEL} --config_version ${CONFIG_VERSION} --disable_generate_new_checkpoint --checkpoint_version 0 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 | tee aimp_697790515.log

Reviewed By: andrewjcg

Differential Revision: D69947027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147556
Approved by: https://github.com/janeyx99
2025-02-21 17:46:04 +00:00
a0c7d96028 [Easy] Add Delimeter To Show Where Allocation Addr Begins (#147461)
Summary: When we print the addr we append an "s" or a "b" to the beginning of an addr. Since the addr is in hex, a user might be confused and think the "b" is part of the address. Added an approstrophe to clear this up

Test Plan: CI

Differential Revision: D69828538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147461
Approved by: https://github.com/zdevito
2025-02-21 17:19:53 +00:00
784f64bb05 [inductor] triton support port-#5512, update cpp wrapper for gpu (#146917)
In short, this pull request enhances `constexprs` expression filtering.

Note: I tested the changes on xpu backend.

Part of https://github.com/pytorch/pytorch/issues/144103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146917
Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/davidberard98, https://github.com/YUNQIUGUO
2025-02-21 17:10:53 +00:00
6a6de0e09d better error message (#147532)
Differential Revision: [D69939736](https://our.internmc.facebook.com/intern/diff/D69939736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147532
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2025-02-21 17:08:47 +00:00
a8ce4d1846 Add cachebench (#147537)
This PR adds a new benchmark called cachebench in order to measure/demonstrate the prowess of PT2 caching.
```
python benchmarks/dynamo/cachebench.py --output="result.json"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147537
Approved by: https://github.com/jamesjwu
2025-02-21 17:06:45 +00:00
af1072ffb6 [Intel GPU] Enable BUILD_GRAPH for xpu_mkldnn (#147608)
For preparation of OneDNN based XPU SDPA enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147608
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-02-21 16:12:30 +00:00
d6bb1d7f0a Delete Mixed MM Special Casing (#147151)
Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup:

- prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not.
- similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path.

It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication.

The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete.

Future optimizations could include:

- cutlass prologue path
- making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2025-02-21 16:02:40 +00:00
36c461af95 Support SymmetricMemory's signaling kernels on sm60 and sm70 (#146308)
By leveraging libcudacxx's utilities: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/atomic_ref.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146308
Approved by: https://github.com/yifuwang
2025-02-21 15:29:02 +00:00
7ce4974e50 Fix PEP585 update (#147536)
Summary: D69920347 causes a pyre failure due to changing a base object from typing.Iterable to abc.Iterable.  For now revert that change until it can be dealt with on its own.

Test Plan:
failures from D69920347 pass locally
unit tests pass

Reviewed By: oulgen

Differential Revision: D69936518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147536
Approved by: https://github.com/jeanschmidt
2025-02-21 14:37:03 +00:00
654f2666d9 Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.12xlarge.memory.ephemeral`.

This change depends on https://github.com/pytorch/test-infra/pull/6316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-21 14:15:40 +00:00
51748a5d1a Update OpenBLAS to 0.3.29 (#144857)
* Improvements for GEMM to GEMV kernels
 * Improvements for SVE kernels for SGEMV and DGEMV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144857
Approved by: https://github.com/malfet
2025-02-21 10:07:06 +00:00
71d2827eeb Code Refactoring for getting start and stride from global ranks (#147230)
Summary: Code Refactoring for getting start and stride from global ranks, this function can be used in different collective backend.

Differential Revision: D69555405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147230
Approved by: https://github.com/kwen2501
2025-02-21 10:02:50 +00:00
e7bf490c43 [ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)
This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm.

Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 10:01:27 +00:00
cffe7183f1 [cutlass backend] Fix standalone runner test after swizzle became a runtime parameter (#147554)
Differential Revision: [D69945114](https://our.internmc.facebook.com/intern/diff/D69945114/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147554
Approved by: https://github.com/mlazos
2025-02-21 09:27:44 +00:00
cyy
b61a556427 Turn onnx functions into static (#147598)
To avoid exposing ONNX symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598
Approved by: https://github.com/justinchuby
2025-02-21 07:40:28 +00:00
3395da7f7c Revert "Build a storage reader/writer to write checkpoints in HF format (#146352)"
This reverts commit c615b8c174c80936d365d19d8b8f4d9ad9a195f3.

Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))
2025-02-21 07:30:52 +00:00
e5da9df421 Revert "Increase memory for linux binary builds (#147542)"
This reverts commit 87e6e2924eb706b928cdfc4a11623b39259fa830.

Reverted https://github.com/pytorch/pytorch/pull/147542 on behalf of https://github.com/jeanschmidt due to seems that it is best to use another machine type ([comment](https://github.com/pytorch/pytorch/pull/147542#issuecomment-2673765724))
2025-02-21 07:14:57 +00:00
4986f0f52e [PT2]: allow empty dict to pass type check (#147167) (#147480)
Summary:

Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models.
```
terminate called after throwing an instance of 'c10::Error'
  what():  forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'.
```
Let empty dict pass type check.

please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover:

Test Plan:
```
MODEL_ENTITY_ID=691508446
SNAPSHOT_ID=0
OTHER_MODEL_ENTITY_ID=649645886
OTHER_SNAPSHOT_ID=0
MODULE=local

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \
    --loadMode=BenchmarkAB \
    --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \
    --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \
    --moduleName=${module} \
    --submodToDevice "" \
    --benchmarkDontRebatchSamples=true \
    --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt
```

Reviewed By: yjhao

Differential Revision: D69871393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480
Approved by: https://github.com/henryoier, https://github.com/jeanschmidt
2025-02-21 07:00:46 +00:00
c74b59fc1f [ROCm][TunableOp] resolve the rocBLAS version dynamically (#147363)
Dynamically gets rocBLAS version instead of relying on some preprocessing-time definitions which may be stale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147363
Approved by: https://github.com/pruthvistony, https://github.com/naromero77amd, https://github.com/jeffdaily
2025-02-21 06:50:21 +00:00
86ae672b6a Use has_triton_package in _inductor.runtime.hints (#147442)
Fixes #ISSUE_NUMBER
Use existing method for triton check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147442
Approved by: https://github.com/Skylion007
2025-02-21 05:52:00 +00:00
533b884870 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-02-21 05:22:19 +00:00
a2c3a2c5c4 Support serialization for uintx/intx in weights_only (#147500)
Summary:
Fixing the issue reported by huggingface

Test Plan:
python test/test_serialization.py -k test_serialization_uintx_intx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147500
Approved by: https://github.com/mikaylagawarecki
2025-02-21 04:38:44 +00:00
c615b8c174 Build a storage reader/writer to write checkpoints in HF format (#146352)
Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.

Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage

N6476188 --> able to save and load tensor in hf format

Differential Revision: D68444967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352
Approved by: https://github.com/saumishr
2025-02-21 03:31:21 +00:00
fe100c3c5b Add libtorch nightly build for CUDA 12.8 (#146265)
Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error

"Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note.

Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support.

https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146265
Approved by: https://github.com/atalman
2025-02-21 03:04:06 +00:00
ba214ab56c TCPStore: soft fail bind when agent store active (#147465)
This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`.

This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical.

This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests.

https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465
Approved by: https://github.com/fduwjj
2025-02-21 03:02:26 +00:00
8a5265cb37 [Intel GPU] qlinear_pointwise.binary[_tensor] XPU support (#135337)
# Motivation
This PR intends to enable quantized fusion `qlinear+add` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary")` and `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary_tensor")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_add_xpu
```

# Runtime Verification
```bash
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu,,4x4:4x4,0.0319824
```
The verbose is collected from UT. We can see the attribute ` attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu`, the post add and ReLU is successfully fused on GEMM computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135337
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/liangan1, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-21 02:09:28 +00:00
8b818ab58f Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353)
Fixes #147303.
Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353
Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch
2025-02-21 01:33:33 +00:00
ac88a6c00d [fx] demote node prepend to self log from warning to debug (#147538)
FIXES https://github.com/pytorch/pytorch/issues/147175

This is harmless, not sure why this is a user warning. Writing reordering graph passes is more concise when we ignore this warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147538
Approved by: https://github.com/yanboliang
2025-02-21 01:32:34 +00:00
4b35139a46 [ROCm][TunableOp] Fix TunableOp warmup environment variable. (#147412)
This PR corrects the behavior of the TunableOp warmup variables:
```
PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS
PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS
```

See the updated comments which describe how the environment variables are intended to work. Previously, if you only set one of the two environment variables the warmup iters would always be zero.

Manually tested the four possible combinations to make sure things still behavior as intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147412
Approved by: https://github.com/jeffdaily
2025-02-21 00:29:58 +00:00
fdb1305ace reland "[sigmoid] Test OSS model runner with test_export.py" (#147535)
Summary: There are ~260 tests for all the corner cases of export from test_export.py. utitlizing to test sigmoid in the OSS setting.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r _sigmoid

Differential Revision: D69937387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147535
Approved by: https://github.com/yiming0416
2025-02-20 23:45:13 +00:00
87e6e2924e Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.24xlarge.ephemeral`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-20 23:02:45 +00:00
be0df96b50 Fix c++ implementation of strip_function_call (#147436)
#143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors.

- Fix all the UCS handling (and make some of the common code more common)
- Make sure all the error paths return `nullptr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436
Approved by: https://github.com/jansel
2025-02-20 20:41:21 +00:00
af31640391 [cutlass backend] enable mixed mm test (cutlass2x) for H100 (#147474)
I am okay with not landing this as well. The motivation is to make developing on H100 smoother.

The reason the current test works on A100 but not H100 is because of alignment issue. Which was caused by arch specific filtering logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147474
Approved by: https://github.com/alexsamardzic, https://github.com/ColinPeppler
2025-02-20 20:28:44 +00:00
d068141c3b [cutlass backend] add subproc tests (#147173)
I want to separate subproc autotuning from the main tests. And I observed that for addmm, it can work without subproc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147173
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #147169
2025-02-20 20:07:42 +00:00
2565951f8a [cutlass backend] remove triton from most tests and add an integration test (#147169)
Removing aten and triton from the list of backends for the tests that have it. Instead, add a small integration test to make sure autotuning works fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147169
Approved by: https://github.com/ColinPeppler
2025-02-20 20:07:42 +00:00
fb1f7f6a09 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/native/miopen/Conv_miopen.cpp +1 (#147496)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69755123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147496
Approved by: https://github.com/Skylion007
2025-02-20 19:00:38 +00:00
6971b77510 [CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935)
Summary: Adds wait_event and record_event endpoints to CPU stream in order to facilitate device-agnostic code. Both methods are noops.

Test Plan: CI

Differential Revision: D68833927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145935
Approved by: https://github.com/Skylion007
2025-02-20 18:50:55 +00:00
863ac20659 [CI] Do not overwrite return code of test file when fails for rerun disabled tests (#147484)
Do not overwrite the return code of a single file when it fails.  This will allow the log to be printed to stdout and the gha logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147484
Approved by: https://github.com/ZainRizvi
2025-02-20 17:51:58 +00:00
83bb921a5a [ROCm] Update meta_registration for efficient attention (#146979)
Fixes a series of failing and skipped unit tests.

For nvidia hw, the longsumexp last dimension is required to be a multiple of 32.  This is not the case for rocm.

A related issue: https://github.com/pytorch/pytorch/issues/146848

The unit tests in question:
```bash
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_6_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_6_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146979
Approved by: https://github.com/shunting314
2025-02-20 15:05:13 +00:00
382fbcc1e4 add the torch.float8_e8m0fnu dtype to PyTorch (#147466)
Summary:

Continuing the work from https://github.com/pytorch/pytorch/pull/146427

Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in
https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format.  Example of basic functionality:

```python
import torch

# round trip
x0 = torch.randn(4, 4, dtype=torch.float32)
x1 = x0.to(torch.float8_e8m0fnu)  # RNE rounding
x2 = x1.to(torch.float32)  # 2 ** exponent

# creation with empty
x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu)

# printing
print(x0)
```

Done in this PR:
* numerical correctness
* op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32
* printing a tensor works

For future PRs:
* performance optimizations for casting
* torch._scaled_mm
* PT2
* various cleanups (detailed in comments with issue numbers)

Test Plan:

```
pytest test/quantization/core/experimental/test_float8.py -s
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466
Approved by: https://github.com/drisspg
2025-02-20 13:55:42 +00:00
574371d828 Add current cuda device index to FXGraphCache key (#147464)
This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405.
It does *not* handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key.

Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key.

A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts.

I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change.

Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2025-02-20 12:38:21 +00:00
ead970c8d0 Revert "Add cifllow/riscv64 label"
This reverts commit 5116b27792d37c38039459c922a466581e219fc2.
(I've pushed to the wrong branch by accident)
2025-02-20 11:55:52 +01:00
5116b27792 Add cifllow/riscv64 label 2025-02-20 11:09:06 +01:00
6beba8dcce Optimize graph.py typing (#147099)
Optimize `graph.py` methods type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147099
Approved by: https://github.com/cyyever, https://github.com/aorenste
2025-02-20 09:32:30 +00:00
f9b8121350 Make Inductor scheduler aware of _scaled_mm (#146992)
This is used for example to estimate runtime when doing comms overlap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146992
Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/shunting314
2025-02-20 09:02:31 +00:00
9da250aada type fully_shard so that the return value can be chained with typing enabled (#147489)
This allows for

```
fsdped = fully_shard(model)
fsdped.set_xyz()
```

same applies if `model` is actually a list of modules

Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489
Approved by: https://github.com/Skylion007
ghstack dependencies: #147488
2025-02-20 08:43:16 +00:00
6a72aaadae Fix torch.max optional args dim, keepdim description (#147177)
[`torch.max`](https://pytorch.org/docs/stable/generated/torch.max.html#torch.max) optional args `dim`, `keepdim` not described in document, but users can ignore them.

```python
>>> import torch
>>> a = torch.randn(3,1,3)
>>> a.max()
tensor(1.9145)
>>> a.max(dim=1)
torch.return_types.max(
values=tensor([[ 1.1436, -0.0728,  1.3312],
        [-0.4049,  0.1792, -1.2247],
        [ 0.8767, -0.7888,  1.9145]]),
indices=tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]))

```

## Changes

- Add `optional` description for `dim`, `keepdim`
- Add example of using `dim`, `keepdim`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/3391bc45-b636-4e64-9406-04d80af0c087)

### After

![image](https://github.com/user-attachments/assets/1d70e282-409c-4573-b276-b8219fd6ef0a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147177
Approved by: https://github.com/colesbury
2025-02-20 08:18:09 +00:00
452315c84f Fix RuntimeError: value cannot be converted to type int64_t without overflow (#147492)
The exact call is coming from here:

78a94c9114/torch/_inductor/memory.py (L161)

I have no idea why this error is being thrown and what mode/modes might be failing for this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147492
Approved by: https://github.com/eellison
2025-02-20 08:00:26 +00:00
a000c7e6d2 Add hint message for pack_padded_sequence (#146747)
Fixes #144207

Add truncate hint message in docs [torch.nn.utils.rnn.pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html)

## Test Result

![image](https://github.com/user-attachments/assets/46258f36-f6c7-4f11-9213-8513e52a9001)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146747
Approved by: https://github.com/mikaylagawarecki
2025-02-20 06:27:07 +00:00
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
76ad19a549 [dynamo][codegen] Implement CSE for pre-graph graph-arg bytecode reconstruction (#147425)
This reduces fixed overhead seen in a few internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147425
Approved by: https://github.com/jansel, https://github.com/StrongerXi
2025-02-20 05:42:52 +00:00
8f6b9403c1 [audio hash update] update the pinned audio hash (#147423)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147423
Approved by: https://github.com/pytorchbot
2025-02-20 05:39:46 +00:00
77aa602871 [torchbind] Differentiate ScriptModule and ScriptObject with qualified name (#147399)
Summary:
This pr add a _is_script_object method to differentiate scriptModule and scriptObject, where the formal inherits from ScriptObject in C++ so they both passes the isinstance(obj, torch.ScriptObject) check.

The qualified name of ScriptObject (i.e. custom class) would starts with "__torch__.torch.classes", this has been a widely used assumption for dealing with custom class across our code base.

Test Plan: Add new test.

Differential Revision: D69685316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147399
Approved by: https://github.com/yushangdi
2025-02-20 04:57:57 +00:00
7185ca8348 [Cutlass] Add test verifying number of precompiles (#147477)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147477
Approved by: https://github.com/henrylhtsang
2025-02-20 04:47:57 +00:00
5f5b44f6bf [ROCm] Update inductor-periodic.yml to use the correct label (#147473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147473
Approved by: https://github.com/jeffdaily
2025-02-20 04:44:18 +00:00
0d56b7e665 Support size oblivious max equation (#147344)
Addresses https://github.com/pytorch/pytorch/issues/125914 by detecting when we have a sym_max between {0, 1} and a summation of size-like unbacked symints.

The basic idea is max(1, u0 + u1) can be simplified to u0 + u1 if both u0 and u1 are size-like since their value ranges are [2, inf].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147344
Approved by: https://github.com/angelayi
2025-02-20 04:33:19 +00:00
0b0da81021 Support static method of torchbind attributes in torch.compile with inductor backend (#146927)
As title.

Many changes adapted from https://github.com/pytorch/pytorch/pull/129537.

Also this diff is only for *static* method of torchbind *attributes*. Some case that's not supported/tested:
- dynamic torchbind objects
-  torchbind objects as an input to the module.

Note that in JIT Inductor, the attributes are lifted as inputs. So even if we just have torchbind objects as attributes, they will show up as inputs in the graph.

Example generated python code in torch.compile with inductor backend for the test case in `inductor/test_torchbind.py` (P1730554370):

```python
async_compile.wait(globals())
del async_compile

def call(args):
    arg1_1, arg2_1, arg3_1 = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    assert_size_stride(arg2_1, (2, 3), (3, 1))
    buf2 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, arg2_1, buf2)
    del arg1_1
    del arg2_1
    # Topologically Sorted Source Nodes: [x, takes_foo_tuple_return], Original ATen: [aten.add]
    buf3 = torch.ops._TorchScriptTesting.takes_foo_tuple_return.default(arg3_1, buf2)
    buf4 = buf3[0]
    assert_size_stride(buf4, (2, 3), (3, 1))
    buf5 = buf3[1]
    assert_size_stride(buf5, (2, 3), (3, 1))
    buf6 = buf4; del buf4  # reuse
    cpp_fused_add_1(buf6, buf5)
    del buf5
    # Topologically Sorted Source Nodes: [y, b], Original ATen: [aten.add]
    buf7 = torch.ops._TorchScriptTesting.takes_foo.default(arg3_1, buf6)
    del buf3
    del buf6
    buf8 = buf7
    assert_size_stride(buf8, (2, 3), (3, 1))
    # Topologically Sorted Source Nodes: [c], Original ATen: []
    buf9 = torch.ops.higher_order.call_torchbind(arg3_1, 'add_tensor', buf2)
    del arg3_1
    del buf7
    buf10 = buf9
    assert_size_stride(buf10, (2, 3), (3, 1))
    del buf9
    buf11 = buf2; del buf2  # reuse
    cpp_fused_add_2(buf11, buf8, buf10)
    return (buf11, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg1_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
    import pickle
    global arg3_1
    arg3_1 = pickle.loads(b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x05torch\x94\x8c\x0cScriptObject\x94\x93\x94)\x81\x94]\x94(K\nK\x14e\x8c0__torch__.torch.classes._TorchScriptTesting._Foo\x94\x86\x94b.')
    fn = lambda: call([arg1_1, arg2_1, arg3_1])
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146927
Approved by: https://github.com/angelayi
2025-02-20 03:33:19 +00:00
de1cb0f351 capture the return value in the contract typing (#147488)
----

* the existing typing makes the return type `Optional[nn.Module]`
* this doesn't seem to be what the decorator actually does as it does
  not alter the original return type
* This PR aims to fix the typing

Differential Revision: [D69888120](https://our.internmc.facebook.com/intern/diff/D69888120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147488
Approved by: https://github.com/Skylion007
2025-02-20 03:32:34 +00:00
fea718f062 [BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730)
Our three main users are OK with this, with two of them (foreach_map,
invoke_quant) prefering it like this.

I was originally worried about BC issues (this now means you cannot add
any positional args) but I think that's not a concern -- one can always
add kwonly args.

Test Plan
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730
Approved by: https://github.com/ydwu4, https://github.com/mlazos
2025-02-20 02:30:36 +00:00
f79b352f5a [Intel GPU] qconv_pointwise.binary XPU support (#135189)
# Motivation
This PR intends to enable quantized fusion `qconv+add` and `qconv+add+relu` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qconv2d_pointwise.binary")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_add_xpu \
   -k test_qconv2d_add_relu_xpu 2>&1
```

# Runtime exemplification
Following is the oneDNN verbose collected from UT
```bash
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_s8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_linear:1:0.337704+sum:0.0241217+eltwise_relu,alg:convolution_direct,mb1_ic3oc6_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.151123
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135189
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jerryzh168
ghstack dependencies: #133307

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-20 02:02:54 +00:00
93316cfe94 Move ir_pre_fusion.txt and ir_post_fusion.txt to TORCH_LOGS (#147248)
Fixes #147002

Moves ir_{pre, post}_fusion.txt to be controlled by TORCH_LOGS instead of TORCH_COMPILE_DEBUG.
Updated tests of these logs as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147248
Approved by: https://github.com/eellison
2025-02-20 00:26:17 +00:00
16e202a38e [dynamo] improved graph break messages for some common graph break sites [1/N] (#146525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146525
Approved by: https://github.com/jansel
2025-02-20 00:08:13 +00:00
1e94c7aaa4 [draft_export] only clear pending unbacked symbols for overwritten kernels (#147427)
This was wrong, we were doing this in all cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147427
Approved by: https://github.com/angelayi
2025-02-20 00:07:54 +00:00
3986c3e4a6 [reland][cutlass backend] Do not change dtype of GEMM template for cutlass 3x (#147434)
Reland of https://github.com/pytorch/pytorch/pull/146877
incorporate forward fix (didn't land): https://github.com/pytorch/pytorch/pull/147185

Summary:
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69825865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147434
Approved by: https://github.com/ColinPeppler
2025-02-20 00:07:07 +00:00
a88d7d4268 [util] fetch logical count cpu (#147413)
To match with Vcpu count with aws:

after (96), before (48)
Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal
before: https://hud.pytorch.org/utilization/13377376406/37360984234/1
after: https://hud.pytorch.org/utilization/13401543806/37435031356/1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413
Approved by: https://github.com/clee2000
2025-02-19 23:44:54 +00:00
004d65aeb0 Add type hints to cuda kernel (#147471)
Missed this in a previous PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147471
Approved by: https://github.com/eellison
2025-02-19 23:35:10 +00:00
48203bec63 [BE] remove sysconfig.get_config_var("LIBDIR") from cuda lib paths (#147409)
Summary: I think the path is not needed anymore. It was added in https://github.com/pytorch/pytorch/pull/126408, but  it has been a while since then. See if CI complains.

Differential Revision: D69573185

See also  https://github.com/pytorch/pytorch/pull/147158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147409
Approved by: https://github.com/chenyang78
2025-02-19 23:04:22 +00:00
f63db6255f Re-land exclude upsample_bilinear2d.vec and nearest2d.vec from default export decomposition table (#147153)
Note: This is a re-land of https://github.com/pytorch/pytorch/pull/141791, which I reverted due to breaking some Meta-internal tests - an internal ET delegate did not handle the non-decomposed upsample_nearest2d, and it was not caught in CI. I've resolved that issue and should be ready to safely re-land.

Summary:
As upsample_bilinear2d.vec and upsample_nearest2d.vec are core ATen ops, they should not be decomposed by default in the export path. Because the operators have CompositeImplicitAutograd dispatch, their decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.

In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining two ops, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.

Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and recompose it, but this is no longer necessary.

Test Plan:
Added a new test (`test_default_decomposition_core_cia_ops`) in test_export.py to verify that upsample_bilinear2d.vec (and in the future, other core-tagged CIA ops) are not decomposed by default. Also, I manually validated end to end with ExecuTorch that the op is not decomposed in to_edge (see N6238522).

```
buck test //caffe2/test:test_export -- test_default_decomposition_core_cia_ops
```

Differential Revision: D69625112

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147153
Approved by: https://github.com/manuelcandales
2025-02-19 23:03:29 +00:00
fb55bac3de [fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439)
The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script.

Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439
Approved by: https://github.com/fegin
2025-02-19 22:09:16 +00:00
41ae15faa3 [ONNX] Add scaffolding for onnx decomp and logic for op tests (#147392)
Create scaffold for onnx op test data and common logic. This PR creates the scaffolding for new onnx decomp functions described in https://github.com/pytorch/pytorch/issues/139301. It adds two ops: abs and add, and enables the related tests.

https://github.com/pytorch/pytorch/issues/139301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147392
Approved by: https://github.com/titaiwangms
ghstack dependencies: #147396
2025-02-19 21:55:12 +00:00
24738768a8 more dist ops in non strict (#147417)
Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`.

Test Plan: added unit tests

Differential Revision: D69813991

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417
Approved by: https://github.com/angelayi
2025-02-19 21:29:16 +00:00
394676759d ci: Add h100 nightly perf testing (#146868)
This infrastructure has been up for a while so add a workflow to actually run things on it.

> [!IMPORTANT]
> We only have **14** linux.aws.h100 runners so it might be beneficial for us to actually pair this list down.
> Will leave it up to the compiler team to comment on this PR on which tests are actually important vs. what is not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146868
Approved by: https://github.com/eellison, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-02-19 21:13:17 +00:00
8bea08e5bc [BE] Fix tensor stub (#147384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147384
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/atalman
2025-02-19 19:47:03 +00:00
e758d8b4d1 [Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)
Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395
Approved by: https://github.com/eellison
2025-02-19 19:45:01 +00:00
279c7f262e [ONNX] Refactor dispatcher and registry (#147396)
This PR sets up the registry to accept onnx decomp functions to be moved into PyTorch (https://github.com/pytorch/pytorch/issues/139301).

The ops from onnx script are currently appended to the registry. When the ops are moved into PyTorch, the moved ops takes precedence because they appear first in the registry list.

After the migration hooks for loading ops from onnx script will be removed.

1. Use a private field `_pt_onnx_signature` to store function signatures to avoid conflicts
2. Update the registry to record the signature in OnnxDecompMeta and update the dispatcher to leverage the data structure
3. Update registry to prepare for onnx op registration, and update the the onnx_impl decorator to support a no_compile option

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147396
Approved by: https://github.com/titaiwangms
2025-02-19 19:38:28 +00:00
4f3c070b25 [inductor] GraphLowering code movement (#147335)
moved these methods under __init__ to be more idiomatic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147335
Approved by: https://github.com/eellison
ghstack dependencies: #147331
2025-02-19 19:32:30 +00:00
5a3a50c791 Update Arm Compute Library (ACL) to v25.02 (#147454)
Among many things, this version of ACL fixes the redundant declaration  warning that we're blocked on in (#145942, #146620, #147337) and introduces better scheduling heuristics for GEMMs

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147454
Approved by: https://github.com/malfet
2025-02-19 18:51:08 +00:00
9fee408daa [caffe2] disable warning for unused arguments (#147411)
Summary: Disable warnings on unused command line arguments for ukernels_asm.

Test Plan:
On top of D69602077:
```
$ buck2 build --flagfile fbsource//xplat/mode/arstudio/auto.py fbsource//xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:ukernels_asmAppleMac
```

Differential Revision: D69807977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147411
Approved by: https://github.com/kimishpatel
2025-02-19 17:54:31 +00:00
5220d402b5 [ROCm] TopK optimizations for AMD GPUs (#146387)
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146387
Approved by: https://github.com/malfet, https://github.com/ngimel
2025-02-19 17:10:59 +00:00
e6c86952c6 Add CUDA 12.8 windows nightly build (#147037)
https://github.com/pytorch/pytorch/issues/145570

windows AMI is deployed to prod today, prepping the windows cuda 12.8 build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147037
Approved by: https://github.com/atalman
2025-02-19 16:59:32 +00:00
8cbf7d0d6e [Inductor UT][XPU] Skip fft_c2c case since it's not implemented on XPU. (#147351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147351
Approved by: https://github.com/jansel
2025-02-19 16:03:03 +00:00
ed83b0b70b [ddp] decouple python reducer from compilation mode (#147123)
Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147123
Approved by: https://github.com/fegin
2025-02-19 15:51:40 +00:00
303ad1916f [FlexAttention] Fix weird generate stride call in flex decode (#147435)
# Summary
Seems like we had a redundant tuple unpack and that doesn't appear to be supported in new triton

Fixes https://github.com/pytorch/pytorch/issues/147373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147435
Approved by: https://github.com/BoyuanFeng
2025-02-19 12:12:27 +00:00
77dbd28535 [Cutlass] Restore search space for swizzle (#147224)
This restores the previous search space, since swizzle is now a runtime parameter, there shouldn't be extra compile-time overhead from searching this now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147224
Approved by: https://github.com/eellison
ghstack dependencies: #147222, #147223
2025-02-19 09:22:51 +00:00
e9b3ff0570 [Cutlass] Add support for runtime param choices, starting with swizzle (#147223)
This PR adds support for swizzle as a runtime parameter choice. Future runtime parameter choices can be added to the [get_runtime_arg_info](2d40f9fb52/torch/_inductor/codegen/cuda/cuda_template.py (L282)) list method and then possible choices can be [looped over similarly to swizzle](933f921b36/torch/_inductor/codegen/cuda/gemm_template.py (L532)). For precompile, we now filter choices by hash to only compile each distinct kernel source once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147223
Approved by: https://github.com/Chillee, https://github.com/eellison
ghstack dependencies: #147222
2025-02-19 09:22:51 +00:00
81eb2a78ad [Inductor] Add autotuning artifact logging (#147222)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147222
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-02-19 09:22:42 +00:00
655b061ef0 [inductor] Freeze runtime asserts after shape prop but before codegen (#147331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147331
Approved by: https://github.com/eellison
2025-02-19 06:29:13 +00:00
454fbd5bbe realize stride symbols in estimate_runtime (#146752)
Unfortuanlty could not create a local repo, or unit test.
fix https://github.com/pytorch/pytorch/issues/146686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146752
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-02-19 06:02:49 +00:00
2c3680ce38 [apf] Fix input adapter (#147238)
Summary: Add support for inputs that no longer exist in `input_fields`, but is not actually used by the original program. In this case, we just give it a dummy input based on the node's metadata.

Test Plan: Verified for S488841

Differential Revision: D69328093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147238
Approved by: https://github.com/pianpwk
2025-02-19 04:49:58 +00:00
465930ee81 Revert "[ROCm] ROCm-specific gemm tuning parameters" (#147388)
Summary:
This diff reverts D69573225 / https://github.com/pytorch/pytorch/pull/143286

15% cold compile time regression, see https://fb.workplace.com/groups/1075192433118967/permalink/1608559059782299/

Test Plan: NA

Differential Revision: D69790102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147388
Approved by: https://github.com/yanboliang
2025-02-19 04:47:35 +00:00
4ece056791 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-19 03:52:26 +00:00
bd370c138a fix pt2e block wise quantization unit test (#147406)
Differential Revision: D69806596

https://github.com/pytorch/pytorch/pull/146946 breaks the unit test, because the quant nodes are folded by default now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147406
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168
2025-02-19 02:40:27 +00:00
5006932cbc [cutlass backend] forward fix of standalone runner for fbcode (#147158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147158
Approved by: https://github.com/chenyang78
2025-02-19 02:02:10 +00:00
f16d30137c [OSS] Update FileSystem methods to properly handle a string argument (#145751)
Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used.

Test Plan: N6475361 methods don't throw an error with a string arg like they were previously

Differential Revision: D68713937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751
Approved by: https://github.com/pradeepfn
2025-02-19 01:50:24 +00:00
953f7834cc [ONNX] Pick up missing types in dynamic shapes renaming (#147407)
Found in `_check_dynamic_shapes` that int and None type are valid inputs of dynamic_shapes.
This PR adds the support on these two types and add the tests to guard the sync of ONNX flatten logic and the one in expor.t
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147407
Approved by: https://github.com/justinchuby
2025-02-19 01:49:53 +00:00
757d7f28d1 [CD] Increase timeout for windows binary builds (#147390)
Mitigates https://github.com/pytorch/pytorch/issues/147376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147390
Approved by: https://github.com/huydhn, https://github.com/jeanschmidt, https://github.com/malfet
2025-02-19 01:15:04 +00:00
959d79f85f [ONNX] Move and improve error reproduction logic in test (#147391)
https://github.com/pytorch/pytorch/issues/139301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147391
Approved by: https://github.com/titaiwangms
2025-02-19 00:00:11 +00:00
babb2dc2af Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 6f7e67c43c13b5675b4ff60cbaa71e5083a22481.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/wdvr due to failing inductor mkldnn_pattern_matcher_cpu tests ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2667186865))
2025-02-18 23:58:31 +00:00
525ca80f53 add unbacked strict mode (#147333)
fixes #145775

This is the first step in introducing a "strict" mode where we don't silent specialize and don't silent graph break. At a high level when we do mark_unbacked(... strict=True), anytime we specialize an unbacked symint we will explicitly error and tell the user their unbacked dimension was specialized to a single value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147333
Approved by: https://github.com/laithsakka
2025-02-18 23:33:55 +00:00
5d547d82e6 Add no_data_dependent_graph_break mode (#147342)
This adds a strict mode `TORCHDYNAMO_UNBACKED_STRICT` to prevent graph breaking when we guard on data dependent. This is a better UX for those who are actively trying to make their model more dynamic, but aren't close enough to full graph to use that flag directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147342
Approved by: https://github.com/laithsakka
2025-02-18 23:33:47 +00:00
bae049b439 Update addr doc (#146482)
Fixes https://github.com/pytorch/pytorch/issues/146399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146482
Approved by: https://github.com/janeyx99
2025-02-18 23:25:38 +00:00
ca397d82a6 [Sigmoid] Fix issues with constant folding and fba_ops (#146948)
Summary:
There are 2 issues:

- `skip_folding_node_fn` isn't considered when propagating constant values. So given a skipped node with constant inputs, it outputs a constant and its users can output constant values and then be included in the constant graph. However, the skipped node is not included in the constant graph when extracting the constant graph. This issue is fixed by checking for skipped node when propagating the constant values and making the skipped node to output unknown value (not constant) so that its users cannot output constant.

- `fba_linear` op can be included in the constant graph but it is not implemented for CPU so constant graph cannot be executed. This issue is fixed by converting `fba_linear` to `aten.addmm`.

- A refactor to allow more fba_ops to be included in the constant graph (via mapping fba_ops to aten ops).

Reviewed By: StellarrZ

Differential Revision: D68716393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146948
Approved by: https://github.com/zhxchen17
2025-02-18 23:17:47 +00:00
c9a15d980f [FSDP2] Simplify shard_placement_fn in test (#146847)
Summary: Found this while checking `shard_placement_fn` for Shampoo shard independent implementation.

Test Plan: OSS CI & tests

Differential Revision: D69412878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146847
Approved by: https://github.com/awgu
2025-02-18 23:01:26 +00:00
c8433c2c6c [BE] correct docs for clock_rate to MHz, fixes #147098 (#147393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147393
Approved by: https://github.com/andrewor14
2025-02-18 22:59:58 +00:00
a21a123fd5 Add fqn_modifier at loading_state_dict and unit test (#146557)
In Fusion model, users might change the state_dict keys by state_dict_hook
The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys.

The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default
During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook.

For example:
There's a state_dict_hook:

```
def _state_dict_hook(self, destination, prefix, keep_vars):
    """Remove "embedding" from the original embedding in the state_dict
    name. This keeps the orginal state dict name for the embedding
    from before fusing with the FusionEmbedding.

    [!Note] This update changes the order of the OrderedDict
    """
    key = prefix + "embedding.weight"
    new_key = prefix + "weight"
    destination[new_key] = destination[key]
    del destination[key]
```

In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module
```
def fqn_modifiers(self) -> Dict[str, str]:
    return {
        "weight": "embedding",
    }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557
Approved by: https://github.com/fegin
2025-02-18 22:54:41 +00:00
7622e29a37 Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))
2025-02-18 22:23:35 +00:00
3f35664ee8 More precise check for shared storage check in inductor/reinplace pass (#147050)
Currently if two tensor share storage we have some logic to avoid re-inplacing. Before this PR two tensors share storage if use same underlying storage even if they do not overlap. This diff enhance the checks to avoid cases when we know tensors do not overlap easily.
mitigate https://github.com/pytorch/pytorch/issues/139628 but does not fix the inductor issue in it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147050
Approved by: https://github.com/zou3519
2025-02-18 21:55:34 +00:00
63e8ad49b8 [dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355)
This PR and the previous:
- Moves parts of `eval_frame.c` to C++.
- Reduces code duplication in `dynamo__custom_eval_frame` and makes the control flow more clear.
- Enables `convert_frame` to signal to `eval_frame.cpp` in a general manner how to evaluate this frame, recursive frames, and future frames with the same code object (default/compile, skip, run-only). e.g. this will allow us to change skipping/cache limit hit eval_frame behavior directly from convert_frame without requiring changes to C/C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146355
Approved by: https://github.com/jansel
ghstack dependencies: #145603
2025-02-18 21:37:12 +00:00
75db0fd8a0 [dynamo] refactor dynamo__custom_eval_frame to C++, refactor SKIP_CODE[_RECURSIVE] (#145603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145603
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-02-18 21:37:12 +00:00
eb892cd768 [codegen] enable SORT and TUPLE_REDUCTION for AMD Triton (#147340)
Looks like Triton's AMD backend supports multiple inputs already.
Let's enable SORT and TUPLE_REDUCTION for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147340
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/eellison
2025-02-18 21:15:23 +00:00
1b047d5d7a Add link to non_blocking/pinmem tutorial in Tensor.to docstrings (#145651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145651
Approved by: https://github.com/svekars
2025-02-18 20:38:01 +00:00
clr
166419b9c1 dynamo: Don't crash when encountering a object with no __name__ (#147246)
This was triggering on ScriptFunctions. Note that other than badly implemented c functiosn, this seems to be almost impossible to trigger, so I wrote a smaller unit test, rather than a full repro. Let me know if people feel strongly and want a full reproduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147246
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007
2025-02-18 20:35:49 +00:00
12v
74682e8595 Fix typo (#147330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147330
Approved by: https://github.com/srinivasreddy, https://github.com/Skylion007
2025-02-18 20:20:34 +00:00
d9b3d76b85 Fix linter warnings (#147386)
https://github.com/pytorch/pytorch/pull/145866 accidentally introduced a warning about const casts and also comparison of unsigned long int with signed long int.

This PR fixes both of those warnings.

Tested by running:

```
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o
```

And I got no warnings or errors. Same with `python setup.py develop`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147386
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-02-18 20:03:16 +00:00
302f56a1f2 Revert "Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)"
This reverts commit 59b7e52ad8f6146b4364515a7f3e54d6f3edd6da.

Reverted https://github.com/pytorch/pytorch/pull/146845 on behalf of https://github.com/jeanschmidt due to Seems to break a few code dependencies in multiple places ([comment](https://github.com/pytorch/pytorch/pull/146845#issuecomment-2666656834))
2025-02-18 19:01:27 +00:00
57060bebf3 [symbolic shapes] Add replacement for backed symints (#147240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147240
Approved by: https://github.com/pianpwk
ghstack dependencies: #146939
2025-02-18 18:49:51 +00:00
84abeaad5c [export] Log evaluate_expr (#146939)
We want to log each symnode created so that we can do provenance tracking in the tlparse report generated for draft export. To do this, we want to assign a unique id to every symnode, which python's `id` function already does, and then for every expression created, we can find the provenance by tracing back through its arguments ids. This logging only happens when dtrace_structured is enabled, which is only when running draft export.

An example output is as follows:

<img width="799" alt="image" src="https://github.com/user-attachments/assets/88bb31b4-8c31-43fb-aa88-08b573b9f71d" />

For the increase in the compile_time_instruction_count benchmark, this seems unavoidable because I need to call `id` to get the unique identifier for each symnode. But I believe `id` is an inexpensive operation, so hopefully it should be ok?  I tried doing the following:
* Originally I was passing around `self`, which is a SymNode, which caused the compile time to be ~6.36M
* I changed it to pass around `id(self)` instead, which reduced the compile time to ~6.33M
* Then I changed it to be passed as a positional arg instead of a kwarg, which reduced the compile time to ~6.22M, but this doesn't seem to be a super worthwhile fix?

#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146939
Approved by: https://github.com/oulgen
2025-02-18 18:49:51 +00:00
c6b331f7d9 Deprecate skip_code_recursive_on_cache_limit_hit config flag (#136970)
Fixes one of #136862

Make `skip_code_recursive_on_cache_limit_hit` flag deprecated.

Affected logic is in here:
6931c1644a/torch/_dynamo/convert_frame.py (L866-L876)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136970
Approved by: https://github.com/williamwen42
2025-02-18 18:48:23 +00:00
6f7e67c43c Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-18 18:44:26 +00:00
dd2a943e14 Fix the AOTI compile failure with ARM CPU for Meta internal (#147204)
Summary: Fix the AOTI compile failure with ARM CPU for Meta internal

Differential Revision: D69642211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147204
Approved by: https://github.com/houseroad
2025-02-18 17:54:34 +00:00
5d675de754 Update ck (#144799)
Updates the CK version and re-implements kernel generation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144799
Approved by: https://github.com/jianyuh
2025-02-18 17:00:27 +00:00
a00d2b5144 s390x: add cleanup for cancelled docker image builds (#147110)
When podman image build is cancelled,
a couple of processes are left behind,
and their existence prevents
proper shutdown of runner container.

Add cleanup step at the end of workflow
using new option recently introduced in podman:
https://github.com/containers/podman/pull/25102

Example of job preventing s390x worker cleaning up and restarting properly:
https://github.com/pytorch/pytorch/actions/runs/13289159296/job/37105230728
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147110
Approved by: https://github.com/huydhn
2025-02-18 16:26:46 +00:00
6edc419d69 Update torch-xpu-ops commit pin (#147358)
Update the torch-xpu-ops commit to [a14d1eaa834a616705068103dc8129319087e864](a14d1eaa83), includes:

- SparseCSR XPU support
- Refine build system

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147358
Approved by: https://github.com/EikanWang
2025-02-18 16:05:25 +00:00
0c8028e877 [export] Loosen symint input serialization (#147237)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147237
Approved by: https://github.com/avikchaudhuri
2025-02-18 13:03:47 +00:00
b10ba0a46c Unify all sympy versions to avoid conflicts within PyTorch (#147197)
As the title stated.

There are some tiny diffrences between 1.13.1 and 1.13.3:
1.13.1:
2e489cf4b1/sympy/core/numbers.py (L1591)

1.13.3:
b4ce69ad5d/sympy/core/numbers.py (L1591)

**Previous PR:**
https://github.com/pytorch/pytorch/pull/143908

**ISSUE Related:**
https://github.com/pytorch/pytorch/issues/147144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147197
Approved by: https://github.com/malfet
2025-02-18 10:51:43 +00:00
d9cf1debf9 [ROCm][Windows] Fix clang-cl error related to -Wmissing prototypes enabled (#146981)
Some of the windows files (fused_kernels.cpp or temp_file.h) contain code that fail to compile when this flag is enabled when built with clang-cl.

This PR resolves the issue by ensuring that even if we build with clang-cl, it doesn't include those flags on windows.

Alternatively if needed, I can fix the files mentioned to pass under this flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146981
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-02-18 07:41:12 +00:00
49e8f9c965 Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 22fae4c5f94eb43f71a2eebc1904880740cb1d60.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to third time is the charm ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2664622598))
2025-02-18 05:11:32 +00:00
59a08138c5 [executorch hash update] update the pinned executorch hash (#147345)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147345
Approved by: https://github.com/pytorchbot
2025-02-18 05:08:06 +00:00
6a2bb629ec Update torch-xpu-ops commit pin (#147302)
Update the torch-xpu-ops commit to [b421032c8fed40df5eaee395c2e7f5f8a7bcc815](b421032c8f), includes:

- Correct int4 weight pack implementation
- Enhance build system: only build one shared library for the user

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147302
Approved by: https://github.com/EikanWang
2025-02-18 05:04:15 +00:00
59915b8dec [Intel GPU] qlinear at XPU backend (#133307)
# Motivation
The PR is intended to enable `onednn.qlinear` and `onednn.qlinear_unary` at Intel GPU.

We register the qlinear ops at C++ backend via `TORCH_LIBRARY_IMPL`, the op this PR registers includes `onednn::qlinear_pointwise`, `onednn::qlinear_pointwise.tensor`, and `onednn::qlinear_prepack`. The prepack conduct transpose on weight for fitting oneDNN requirement on weight to acquire higher performance.

Also, we remove the limitation of the corresponding annotation method in  the `XPUInductorQuantizer` (`torch/ao/quantization/quantizer/xpu_inductor_quantizer.py`) to allow GPU linear conversion.

We add the kChar(`torch.int8`) dtype in the `torch/_inductor/fx_passes/quantization` and `torch/_inductor/mkldnn_ir.py`, as signed int8 is the default INT8 data type at GPU side.

We verified the op through UTs and e2e model testing like ResNet18, ResNet50.

# UT verification
```
 DNNL_VERBOSE=0 TORCH_COMPILE_DEBUG=0 python test/inductor/test_mkldnn_pattern_matcher.py -v  \
     -k test_qlinear_xpu \
     -k test_qlinear_relu_xpu \
     -k test_qlinear_gelu_xpu
```

# Runtime exemplification
Here is the oneDNN verbose collected through running above UTs
```
//pure int8 gemm
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32,,2x4:4x3,0.187988
// post-relu fusion
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_relu,,2x4:4x4,0.115234
// post-gelu fusion
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_gelu_tanh,,2x4:4x4,0.170898

````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133307
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jerryzh168

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-18 04:02:42 +00:00
bb8c4ecc6d Allow XPU device for validating the arguments to sparse compressed tensor factory functions (#147306)
During Sparse tensor conversion, a validity check is performed. We need to allow XPU to pass this check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147306
Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey
2025-02-18 03:55:54 +00:00
71484a2106 [pt2-benchmarks] Compiler reset on every run (#147313)
Internal benchmarks call `run` in a loop. Compiler reset gives a clean env

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313
Approved by: https://github.com/jansel
2025-02-18 02:09:19 +00:00
708428704e patch for block-wise quantization + pt2e (#146946)
Summary: https://github.com/pytorch/pytorch/pull/144492 was reverted due to duplicate kernel registration. This PR will re-introduce the patch

Differential Revision: D69488779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146946
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14
2025-02-18 01:15:26 +00:00
59b7e52ad8 Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)
Fix https://github.com/pytorch/pytorch/issues/145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845
Approved by: https://github.com/Skylion007
2025-02-17 22:42:16 +00:00
1393f9a76c [ROCm] Update inductor-perf-test-nightly-rocm.yml to use the correct labels & frequency (#147221)
This workflow takes around 75-80hrs on ROCm, so scaling down the frequency to once per week until we get more CI capacity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147221
Approved by: https://github.com/pruthvistony, https://github.com/huydhn
2025-02-17 19:29:27 +00:00
6c0e7463af Fix test_device_memory_allocated (#147311)
Fixes #147310

The `torch.ones`  allocates memory and is released immediately, thus the following assertion will fail.
This PR stores it into a temp variable to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147311
Approved by: https://github.com/guangyey, https://github.com/Skylion007
2025-02-17 19:00:53 +00:00
516133ddb0 Fix arvr macOS buck pytorch builds (#147292)
Summary:
X-link: https://github.com/ctrl-labs/src2/pull/42453

buck arvr macOS builds had a few issues that needed fixing.

Test Plan: build with buck

Differential Revision: D69722372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147292
Approved by: https://github.com/Skylion007
2025-02-17 18:47:24 +00:00
22fae4c5f9 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-17 18:39:10 +00:00
1b29de5c05 Add NEON implementation for 8 bit quantized embedding bag on aarch64 (#147322)
This improves performance by ~5.5x on NeoverseV1 cores using the following benchmarking script:
```
import torch
import torch.nn as nn
import numpy as np
import torch.autograd.profiler as profiler

np.random.seed(0)
torch.manual_seed(0)

class SimpleEmbeddingBagModel(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(SimpleEmbeddingBagModel, self).__init__()

        weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32))
        obs = torch.ao.quantization.PerChannelMinMaxObserver(dtype=torch.quint8, qscheme=torch.per_channel_affine_float_qparams, ch_axis=0)
        obs(weights)
        qparams = obs.calculate_qparams()
        qweight = torch.quantize_per_channel(weights, qparams[0], qparams[1], axis=0, dtype=torch.quint8)

        # Defining the EmbeddingBag layer
        self.qembedding_bag = torch.ao.nn.quantized.EmbeddingBag(num_embeddings, embedding_dim, _weight=qweight,
                                                                 mode='sum', include_last_offset=True, dtype=torch.quint8)

    def forward(self, input, offsets):
        # Forward pass through the EmbeddingBag layer
        result = self.qembedding_bag(input, offsets, per_sample_weights=None)
        return result

num_embeddings = 40000000
embedding_dim = 128

model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
model.eval()

multi_hot = 100
batch_size = 400

input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long)

offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot))

with torch.no_grad():
    # warm up
    _ = model(input_tensor, offsets)

    with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
        for i in range(100):
            _ = model(input_tensor, offsets)
    print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=50))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147322
Approved by: https://github.com/malfet
2025-02-17 17:10:47 +00:00
71855a1cad Update slow tests (#147308)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147308
Approved by: https://github.com/pytorchbot
2025-02-17 12:03:40 +00:00
e8b20f6ef3 [MPS][BE] Turn exec_unary_kernel as MetalShaderLibrary method (#147299)
And delete duplicate implementations from SpecialOps and UnaryKernel.
Change input and output arguments order for SpecialOps kernels to match those of UnaryOps

Fixes https://github.com/pytorch/pytorch/issues/146770
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147299
Approved by: https://github.com/dcci
ghstack dependencies: #147296, #147297
2025-02-17 08:31:24 +00:00
ae5f7fec82 [Intel GPU] Enable fp64 GEMM (#140677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140677
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/desertfire
2025-02-17 08:15:55 +00:00
2b30e94fc0 [BE] Make exec_unary_kernel take TensorIterator as argument (#147297)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147297
Approved by: https://github.com/dcci
ghstack dependencies: #147296
2025-02-17 07:34:35 +00:00
3d251e6512 [BE] Switch all structured funcs to stubs (#147296)
No need to have separate foobar_out_mps when registering a dispatch to  foobar_stub will do

And this makes `exec_unary_kernel` defined in UnaryKernel.mm and
SpecialOps.mm look very similar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147296
Approved by: https://github.com/dcci
2025-02-17 07:34:34 +00:00
424c1b82e0 [Inductor][CPP] Add the legalize low fp support for index expr (#147298)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/147279. The test case produced a low-precision floating-point value using `ops.index_expr`, but the CPP backend did not handle its legalization. This PR adds support for it.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_low_fp_index_expr_issue_147279
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147298
Approved by: https://github.com/jgong5
2025-02-17 07:11:20 +00:00
359165734b [executorch hash update] update the pinned executorch hash (#147294)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147294
Approved by: https://github.com/pytorchbot
2025-02-17 05:03:05 +00:00
ae351d4d0e [Intel GPU] allow_tf32 for oneDNN backend - XPU part (#137570)
# Motivation
Add context variable `torch.bachend.mkldnn.allow_tf32` to control tf32 computation in convolution kernels at XPU side.  The tf32 data type is beneficial to improve the performance of deep learning workloads during training/inference. Current PR uses the [oneDNN API fpmath_mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#the-floating-point-math-mode-attribute) to trigger the tf32 acceleration in convolution kernels.

# Valiadation
* ut to test context variable
`python test/xpu/test_conv.py -k test_mkldnn_allow_tf32_get_set`

* Runtime exemplification
```
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.649902
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.151855
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_data,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_undef::undef::: dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.167969
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.26709
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.219971

```
According to the field `fpmath:tf32` in verbose, we could see that, current context setting utils could successfully trigger tf32 computation in conv forward/backward_data/backward_weights kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137570
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2025-02-17 01:46:43 +00:00
198ffbdf11 [MPS] Implement and test round.decimals (#147266)
If inductor can do it, why not eager
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147266
Approved by: https://github.com/Skylion007
ghstack dependencies: #147286
2025-02-16 23:17:13 +00:00
e738f7ba23 [BE]: Enable ruff rule SIM113 (#147290)
Lint rules that tells the user to avoid keeping track of their own counter and use the builtin enumerate when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147290
Approved by: https://github.com/jansel
2025-02-16 22:41:16 +00:00
a8fa4bcfd2 [StaticRuntime] Support a new pattern (aten::to with 5 inputs) for ClipRangesToGatherToOffsets (#147189)
Summary:
Support the following new pattern for ClipRangesToGatherToOffsets:

Before optimization:
```
%11175 : Tensor, %11176 : Tensor = fb::clip_ranges_gather(%int_66.1, %getitem_1784.1, %347)
%getattr_256.1 : int = prim::dtype(%11175)
%to_298.1 : Tensor = aten::to(%11176, %getattr_256.1, %13, %13, %12)
%lengths_to_offsets_333.1 : Tensor = fb::lengths_to_offsets(%to_298.1, %8)
```

After optimization:
```
%11199 : int = prim::dtype(%int_66.1)
%11200 : Tensor, %11201 : Tensor = fb::clip_ranges_gather_to_offsets(%int_66.1, %getitem_1784.1, %347, %8, %11199)
```

It is similar with https://github.com/pytorch/pytorch/pull/146931, but aten::to has 5 inputs instead of 4.

Differential Revision: D69627793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147189
Approved by: https://github.com/hanyilou123
2025-02-16 22:16:02 +00:00
5c0c99f658 [MPS][BE] Use stubs for floor/ceil/round/trunc (#147286)
To avoid duplicating logic that those ops are no-ops for integral dtypes
(And in preparation of adding `round_decimals` that calls round_stub if decimals are 0)

Tested for the corner cases by manually invoking `round`, `trunc`, `floor` and `ceil` for int dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147286
Approved by: https://github.com/Skylion007
2025-02-16 17:22:49 +00:00
d27ecf85db xpu: support sycl with torch.utils.cpp_extension APIs (#132945)
This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension.

Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension.

By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for.

Fixes: #132944

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945
Approved by: https://github.com/albanD, https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-16 16:50:59 +00:00
dd5d0ea6bb Revert "xpu: support sycl with torch.utils.cpp_extension APIs (#132945)"
This reverts commit 607379960bc5093a1fe51ff72c3e0fd39ac126ab.

Reverted https://github.com/pytorch/pytorch/pull/132945 on behalf of https://github.com/malfet due to It just broke all the tests, see b16ae97ad0/1 ([comment](https://github.com/pytorch/pytorch/pull/132945#issuecomment-2661498747))
2025-02-16 16:03:42 +00:00
b16ae97ad0 Generalize mixed precision in DDP (#146808)
**Motivation:**

1. Generalize mixed precision in DDP.
2. Enable `SyncBatchNorm` for XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146808
Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/wconstab
2025-02-16 11:59:40 +00:00
ee38a32c55 [Dynamo] support isinstance(...) check for type tuple (#146984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146984
Approved by: https://github.com/jansel
2025-02-16 10:41:49 +00:00
607379960b xpu: support sycl with torch.utils.cpp_extension APIs (#132945)
This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension.

Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension.

By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for.

Fixes: #132944

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945
Approved by: https://github.com/albanD, https://github.com/guangyey
2025-02-16 10:16:09 +00:00
ed3b119c40 Skip unsupported types by MPS in test_torchinductor.py (#147211)
- Skip unsupported dtypes in `test_split_cumsum` (and manually skip int64 for MacOS13)
- Adapt `test_cat` to use `torch.half` instead of `torch.double` on MPS
- Skip `test_adaptive_avg_pool1d_argmax` is avgpool is not implemented for all sizes
-

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147211
Approved by: https://github.com/jansel, https://github.com/Skylion007, https://github.com/dcci
2025-02-16 10:15:53 +00:00
0fb5b224b7 [DCP] Cache save plans: planner helpers and interface updates (#147116)
Summary:
This PR updates the planner interface and introduces the class variables to cache the local and global plans.
Two new helpers are also introduced which will be used to compare if the plans have changed across save attempts and merge the delta plans.

Test Plan: UTs

Differential Revision: D69224488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147116
Approved by: https://github.com/MeetVadakkanchery, https://github.com/huydhn
2025-02-16 07:18:26 +00:00
4bacd13c92 [executorch hash update] update the pinned executorch hash (#147273)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147273
Approved by: https://github.com/pytorchbot
2025-02-16 05:11:33 +00:00
8f20026bcb [Intel GPU] Support SparseCsrXPU codegen (#144722)
Adding a new dispatch key - `SparseCsrXPU`  to enable Intel GPU support for SparseCsr Tensor.

Similar PR: https://github.com/pytorch/pytorch/pull/139267
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144722
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD

Co-authored-by: Kanya-Mo <kanya.mo@intel.com>
2025-02-16 03:16:12 +00:00
1677a31019 [Inductor] Fix 3D tiling with permute (#147249)
This PR adds a test case and tiny fix for 3D tiling. Before this PR, tiling would crash because one of the candidates lacked a `"y"` dimension. Now, when we're calculating 3D tiling candidates, we assume the y size is 1 if it's missing.

The test case implements a 3D permute using block pointers.

```
@triton.jit
def triton_poi_fused_add_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    znumel = 51
    ynumel = 51
    xnumel = 51
    zoffset = tl.program_id(2) * ZBLOCK
    zindex = zoffset + tl.arange(0, ZBLOCK)[None, None, :]
    zmask = zindex < znumel
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = xindex < xnumel
    x2 = xindex
    y1 = yindex
    z0 = zindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2])
    tmp1 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[51, 1, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2])
    tmp2 = tmp0 + tmp1
    tmp3 = tmp0 + tmp0
    tmp4 = tmp2 + tmp3
    tl.store(tl.make_block_ptr(out_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), tl.broadcast_to(tmp4, [XBLOCK, YBLOCK, ZBLOCK]).to(tl.float32), boundary_check=[0, 1, 2])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147249
Approved by: https://github.com/jansel
2025-02-15 23:28:36 +00:00
44ee9ca593 [inductor] Add type annotations to _inductor/utils.py (#144108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144108
Approved by: https://github.com/eellison
2025-02-15 23:13:41 +00:00
4ab967c44d all reduce non strict (#147133)
Summary:
Some distributed collectives like `all_reduce` have special handling in Dynamo, where they are mapped to functional collectives. Non-strict was previously blind to such mappings, which means using them would fail to trace. Here we show how intercepting them in non-strict's torch function mode can mimic this remapping logic. More ops to follow.

Side note: a recently added distributed test was in the wrong place, making the expected failures for non-strict not fire because we weren't actually generating those tests to begin with! Now fixed.

Test Plan: moved and updated test

Differential Revision: D69607140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147133
Approved by: https://github.com/tugsbayasgalan
2025-02-15 19:37:08 +00:00
75a4b73816 utils: Update md5 call to be fips compliant (#147252)
Updates md5 call to be fips compliant according to this issue:
* https://github.com/pytorch/pytorch/issues/147236

Not going to add a conditional here because minimum the python version
that we support is already 3.9

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147252
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/malfet
2025-02-15 15:19:08 +00:00
6ca5c22e31 Revert "Enable fp16 linear layers in PyTorch via ACL (#144992)"
This reverts commit 5b37249259ad50d9b4b32a78a5b5178a1eb3d110.

Reverted https://github.com/pytorch/pytorch/pull/144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](https://github.com/pytorch/pytorch/pull/144992#issuecomment-2660902238))
2025-02-15 12:40:59 +00:00
86be5d4421 remove unnecessary xpu availability check when retrieving aot flags (#146966)
As title

Retrieving xpu aot flags that the pytorch binary was compiled against is not the same as running the binary itself. Thus it doesn't seem to necessarily check if there is an xpu environment available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146966
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/albanD
2025-02-15 09:15:49 +00:00
9e0b3e9b6c [Inductor] Fix Inplace Buffer inner name conflict (#147199)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/146975, when create `InplacedBuffer` inner name, we only count the number of unique `InplacedBuffer` or `RemovedArg`. The name may have conflict, for example reported in this issue

```
---- make inplace create, input_name is: buf22; output_name is: buf27; buf.inner_name is: in_out_ptr2
dict_values([
InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']),
InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']),
InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26']),
InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26'])])

---- make inplace create, input_name is: buf0; output_name is: buf3; buf.inner_name is: in_out_ptr2
dict_values([
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']),
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33'])
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']),
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33'])
])
```

- The first time create `in_out_ptr2`, there are 2 unique `InplacedBuffer`

- The second time create `in_out_ptr2`, there is 1 `RemovedArg` and 1 unique `InplacedBuffer`

They are 2 different `InplacedBuffer`, but with same name `in_out_ptr2`. In this PR, we fix this regression by counting the number of `RemovedArg`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147199
Approved by: https://github.com/jansel
2025-02-15 08:31:06 +00:00
a30f145101 [inductor] Don't leak pointers to cpp_wrapper with lru_cache (#147233)
Putting lru_cache on methods will keep pointers to the `self` objects
alive forever and leak memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147233
Approved by: https://github.com/yanboliang
2025-02-15 08:25:41 +00:00
9dc702875d [dynamo][mappingproxy][inspect] Support existing types.MappingProxyType (#147217)
Fixes https://github.com/pytorch/pytorch/issues/147162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147217
Approved by: https://github.com/williamwen42
2025-02-15 07:59:33 +00:00
cyy
8daa742e8b Remove code for Python < 3.9 (#147181)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147181
Approved by: https://github.com/albanD
2025-02-15 06:43:26 +00:00
9919375cf1 [executorch hash update] update the pinned executorch hash (#147241)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147241
Approved by: https://github.com/pytorchbot
2025-02-15 05:02:22 +00:00
cyy
8f291e8c00 Fix clang-tidy warnings in torch/jit (#146963)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963
Approved by: https://github.com/davidberard98
2025-02-15 03:36:59 +00:00
4233a77960 update kineto submodule to include fix for windows build (#147195)
Fixes an issue causing windows builds to fail
https://github.com/pytorch/kineto/pull/1039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147195
Approved by: https://github.com/cyyever, https://github.com/davidberard98, https://github.com/sraikund16
2025-02-15 01:53:16 +00:00
c1fcba3648 [Inductor] Fix the lowering of squeeze when input is not contiguous (#146746)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/143498. The issue happens when we lowering `select = torch.ops.aten.select.int(cat, 1, 0)`.

For example, when `cat` is contiguous with size[2, 2] stride[2,1]

- for eager, it returns a view of size[2,] stride[2,]
- for Inductor lowering, it returns wrong stride 1 instead of 2
```
TensorBox(
  ReinterpretView(
    StorageBox(
      ConcatKernel(name='buf10', layout=FixedLayout('cpu', torch.int64, size=[u0, 2], stride=[2, 1]), inputs=[ComputedBuffer(name='buf8', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b856449d0>, ranges=[u0, 1])), ComputedBuffer(name='buf9', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b85644790>, ranges=[u0, 1]))])
    ),
    FixedLayout('cpu', torch.int64, size=[u0], stride=[**1**]),
    origins=OrderedSet([select])
  )
)
```

To fix this issue, we give the right stride when lowering of `squeeze`.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_unbacked_symints.py -k test_issue_143498
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146746
Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/eellison
2025-02-15 01:33:04 +00:00
bf0c89a72f [dynamo] fix error message when logging graph that contains hops (#147227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147227
Approved by: https://github.com/zou3519
2025-02-15 00:53:44 +00:00
933f921b36 [PT][FSDP] support custom all reduce hook across FSDP units (#147114)
This change adds an API `set_all_reduce_hook` to the `FSDPModule` to
support customized all reduce either in native HSDP (2d mesh) setup or custom HSDP (1d FSDP + custom AR across replicas)
* For native HSDP, the original AR would still run as is and this hook allows for additional gradient modification post all reduce.
* For custom HSDP, the original AR will be skipped and all the logic is instead expected to be executed in the hook.

The custom hook is expected to perform operations in place (no return value).

Example basic usage:
```
model = ...
fully_shard(model, mesh=...)
model.set_all_reduce_hook(my_hook)
```

By default, the hook will run in the default all reduce stream post reduce scatter.
When native HSDP is NOT enabled, the custom hook can be specified to run in a custom stream. This custom stream will also be synchronized post reduce scatter similarly. See tests for examples.

Test Plan: CI

Differential Revision: D68255583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147114
Approved by: https://github.com/awgu
2025-02-15 00:38:00 +00:00
a9ae3340ca Fix triton masked loading for non-block tl.loads (#144782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782
Approved by: https://github.com/eellison
2025-02-15 00:07:33 +00:00
49727bbc9d Turn on prologue fusion (#147008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147008
Approved by: https://github.com/masnesral
2025-02-14 23:36:21 +00:00
76f57e184a [dynamo] Make SliceVariable a subclass of VariableTracker (#147046)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147046
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146819, #146995
2025-02-14 23:22:27 +00:00
a5c0dab900 [AOTInductor] Guard RAII_cpuMalloc with macro (#147150)
Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function]

Test Plan: Existing tests

Differential Revision: D69623481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150
Approved by: https://github.com/henrylhtsang
2025-02-14 23:21:35 +00:00
1224765286 [cond] make cond call fake kernel in dynamo (#147045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147045
Approved by: https://github.com/zou3519
ghstack dependencies: #146954
2025-02-14 23:13:15 +00:00
85a82c5bc8 [cond] make cond re-dispatch in proxy mode (#146954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954
Approved by: https://github.com/zou3519
2025-02-14 23:13:14 +00:00
eecee5863e Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 21:23:19 +00:00
d38db94689 [inductor][refactor] Move _compile_file to cpp_builder (#147202)
Summary: To further conslidate cpp build logic into cpp_builder

Test Plan: CI

Differential Revision: D69595327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147202
Approved by: https://github.com/yushangdi
2025-02-14 21:02:30 +00:00
dd86491b35 [cutlass backend][BE] refactor tests to remove duplicate logic (#146743)
Doing many things here:
* remove duplicate hip checking logic
* check for CUDA in setup
* remove CUTLASS_DIR setting. That is not needed when building from source and fbcode anymore
* fix some typing errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146743
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-02-14 20:50:27 +00:00
6f035d8462 [torch] Make amdsmi cdll hook private (#147207)
Summary: https://github.com/pytorch/pytorch/actions/runs/13314282597/job/37186177974 yelled at me for landing a seemingly public API that's not exported. It's a private API, so lets prepend `_` to make that clear

Test Plan: CI

Differential Revision: D69665234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147207
Approved by: https://github.com/PaulZhang12
2025-02-14 20:30:48 +00:00
272ead7b5e Make fx.node.map_arg() and .map_aggregate() generic (#146248)
## What's the problem?

The popular `fx.node.map_arg()` and `fx.node.map_aggregate()` apply operations recursively on `dict`s, `tuples`, `list`s, etc, and return a new collection of the same type.

Unfortunately, their base input type is `Argument`, which is [very unspecific indeed](5d55a6585d/torch/fx/node.py (L48-L58)): most type information is just thrown away at the call site of either of these functions, as far as the type checker goes.

As `torch` moves to a more typed code base, this would force innocent, unsuspecting developers to add logically unnecessary casts or `# type: ignore` statements.

## What's the solution?

Making these two `node.map_*` functions generic on the first argument and return type means that type information is preserved for the type checker. (The signature of the other parameter, the function that visits the nodes and subnodes, has not changed, nor should it.)

## Won't it break everything?

It doesn't break the type checker - one place needed an extra hint.

There have been code breakages, resolved one, at least one new one... we'll see!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146248
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-02-14 19:25:32 +00:00
58f654b5ad [ONNX] Consolidate constants to a single location (#147166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147166
Approved by: https://github.com/titaiwangms
ghstack dependencies: #147164, #147165
2025-02-14 19:08:19 +00:00
765bc30ab9 [ONNX] Set warning stacklevel so it appears at the torch.onnx call site (#147165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147165
Approved by: https://github.com/Skylion007
ghstack dependencies: #147164
2025-02-14 19:04:43 +00:00
9a1eac6704 [ONNX] Handle number of outputs in builder (#147164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147164
Approved by: https://github.com/titaiwangms
2025-02-14 19:03:51 +00:00
5517eb4398 Revert "[cutlass backend] Do not change dtype of GEMM template (#146877)"
This reverts commit 260b21b8bca6edd3e0b89b800d6efa8243f0d122.

Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to let me resubmit  ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2660053270))
2025-02-14 18:58:18 +00:00
aac5d1a289 Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit f0bdc27f74f8b1d4ab6789156691ee0fd5cbb30f.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like internal ideep version is too old to support this ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2660008996))
2025-02-14 18:31:54 +00:00
20a9938069 try print stacktrace for error (#147061)
Differential Revision: D69573525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147061
Approved by: https://github.com/Skylion007
2025-02-14 18:28:03 +00:00
8b5ee275fb [MPS] Fix cholesky_ex for empty inputs (#147159)
By making sure that `info` is actually initialized  if input is empty(but no need to do anything about `out`, is it's guaranteed to be an empty tensor)

Also move output resizing logic before `input.numel()` check

Fixes https://github.com/pytorch/pytorch/issues/147128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147159
Approved by: https://github.com/albanD
2025-02-14 17:44:08 +00:00
0d16188c06 [CI] Use job name to index into test times json (#147154)
When the test times are generated, it doesn't know what the build environment is because it's an environment variable.  But when we index into the test times, we (previously) didn't know what the job name is.  These are usually the same but sometimes they're different and when they're different it ends up using default, which can have unbalanced sharding

I think job name was added at some point to most of the CI environments but I didn't realize, so we can now update this code to use the job name instead so the generation and the indexing match

also upload stats workflow for mps

Checked that inductor_amx doesn't use default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147154
Approved by: https://github.com/huydhn
2025-02-14 17:06:56 +00:00
e8fbc86de0 Make torch.cuda.gds APIs public (#147120)
Follow up to https://github.com/pytorch/pytorch/pull/145748 that turned USE_CUFILE on for CUDA 12.6 and 12.8 binaries

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147120
Approved by: https://github.com/albanD
2025-02-14 17:06:50 +00:00
c3853d924f Introduce new template heuristic for triton autotune configs (#144985)
Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in.

This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs:
- CPUConfigHeuristic()
- CUDAConfigHeuristic()
- ROCmConfigHeuristic()
- XPUConfigHeuristic()

These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs.

The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985
Approved by: https://github.com/jansel
2025-02-14 17:01:06 +00:00
e06ee4aa9f Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))
2025-02-14 16:44:46 +00:00
059dfe2081 Revert "update kineto submodule (#147015)"
This reverts commit d1997b610f5b974af7ebad6b9903d2d8f751d927.

Reverted https://github.com/pytorch/pytorch/pull/147015 on behalf of https://github.com/atalman due to broke windows builds ([comment](https://github.com/pytorch/pytorch/pull/147015#issuecomment-2659730304))
2025-02-14 16:11:08 +00:00
06f4a5c0e5 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 15:29:59 +00:00
cefd9805de Add RAISE_VARARGS 0 (#146493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146493
Approved by: https://github.com/zou3519
ghstack dependencies: #146498, #146492
2025-02-14 13:37:23 +00:00
134723ee1c Add WITH_EXCEPT_START opcode (#146492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146492
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #146498
2025-02-14 13:37:23 +00:00
dbb86b78ad Add sys.exc_info and sys.exception (#146498)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146498
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-02-14 13:37:14 +00:00
ea188ac0c7 [export] Add meta for aten.bincount (#147129)
Fixes https://github.com/pytorch/pytorch/issues/147094
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147129
Approved by: https://github.com/pianpwk
2025-02-14 10:33:54 +00:00
de26ddfbdc Update torch-xpu-ops commit pin (#146671)
Update the torch-xpu-ops commit to [80c375570e2b6b2989a8610da1871f8a50dfddc7](80c375570e), includes:

- Aten operator coverage improvement
- SYCL kernel optimization
- Nested Tensor OPs support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146671
Approved by: https://github.com/EikanWang
2025-02-14 09:30:36 +00:00
bd019c0bb4 [Inductor][CPP] Fix node name for wgt delete (#147056)
**Summary**
This is a regression issue caused by a change in the FX node name. In commit 71010bf0972834e35a155e6a187e5c6649a5a36b, both the node name and target for the `get_attr` node in `V.graph.graph.nodes` were `_frozen_param2`. However, in the latest main, the node name has changed to `_reorder_linear_weight`. This PR fixes the regression by using the node's target instead of its name.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_weight_prune
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147056
Approved by: https://github.com/jgong5
2025-02-14 06:27:41 +00:00
10bc8f25b2 [MPS][BE] Migrate polar to use functor (#147184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147184
Approved by: https://github.com/dcci
ghstack dependencies: #147182, #147183
2025-02-14 06:25:36 +00:00
278ffd84fc [MPS][BE] Add copysign integral flavors as functor (#147183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147183
Approved by: https://github.com/dcci
ghstack dependencies: #147182
2025-02-14 06:25:36 +00:00
2ef51cfb9d [BE][MPS] Infer results of functor (#147182)
Do not assume that functor will return the same results as its arguments, but rather dynamically infer it using `decltype` and `:🤘:declval`
This is a no-op that prepares for migration of `copysign` of integral arguments, that would return a float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147182
Approved by: https://github.com/dcci
2025-02-14 06:25:27 +00:00
331d5cf560 [inductor] [cpp] Support vectorization for score and mask in FlexAttention CPU (#143638)
## Description
We generate vectorized kernel for score and mask in FlexAttention with this PR.

## Modification
The main change include:
- For the input and output buffer to the mask and score function, instead of passing scalars, we pass tensors to it.
- For the mask function, the original function which works on a scalar only includes the logic of calculating the mask value. The PR added the logic of applying the mark to the qk_data tensor into the graph and then leverage the CPP backend to generate vectorized kernels.
  The original mask graph:
  ```python
  def mask_fn(b, h, q_idx, kv_idx):
      mask = q_idx >= kv_idx
      return mask
  ```
  The converted_mask_graph should be:
  ```python
  def converted_mask_fn(qk_data, b, h, q_idx, kv_idx):
      mask = q_idx >= kv_idx
      qk_data = torch.where(mask, qk_data, torch.full_like(qk_data, -float("inf")))
      return qk_data
  ```

## Benchmark
For q, k, v of shape: `[1, 32, 1024, 128]`, using 40 CPU cores, we observe over 20x speedup compared with the non vectorized version for both `is_causal` = `False` and `True`.

## Test plan
The existing FlexAttention UTs (`test/inductor/test_flex_attention.py`, `test/inductor/test_flex_decoding.py`) can cover the change in this PR.

## Output code

**Code before this PR is in scalar version:**

```cpp
// apply score mod function
for (int64_t row = 0; row < cur_qSplitSize; ++row) {
    for (int64_t col = 0; col < cur_kvSplitSize; col++) {
    std::vector<int64_t> b_idx = {i};
    std::vector<int64_t> h_idx = {j};
    std::vector<int64_t> q_idx = {m+row};
    int64_t phisical_kv_idx = n+col;
    if (use_kv_indice) {
        phisical_kv_idx= *kv_logical_data * kvBlockSize + col;
    }
    std::vector<int64_t> kv_idx = {phisical_kv_idx};
    accum_t* in_ptr0 = qk_data + row * cur_kvSplitSize + col;
    auto in_ptr1 = b_idx.data();
    auto in_ptr2 = h_idx.data();
    auto in_ptr3 = q_idx.data();
    auto in_ptr4 = kv_idx.data();

    accum_t* out_ptr0 = in_ptr0;
    {
        {
            {
                auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
                out_ptr0[static_cast<int64_t>(0L)] = tmp0;
            }
        }
    }

    }
}
// Apply block mask, fill unused with -inf
for (int64_t row = 0; row < cur_qSplitSize; ++row) {
    for (int64_t col = 0; col < cur_kvSplitSize; col++) {
    std::vector<int64_t> b_idx = {i};
    std::vector<int64_t> h_idx = {j};
    std::vector<int64_t> q_idx = {m+row};
    int64_t phisical_kv_idx = n+col;
    if (use_kv_indice) {
        phisical_kv_idx= *kv_logical_data * kvBlockSize + col;
    }
    std::vector<int64_t> kv_idx = {phisical_kv_idx};
    accum_t* qk_block = qk_data + row * cur_kvSplitSize + col;
    auto in_ptr1 = b_idx.data();
    auto in_ptr2 = h_idx.data();
    auto in_ptr3 = q_idx.data();
    auto in_ptr4 = kv_idx.data();

    std::vector<int64_t> temp = {0};
    int64_t* out_ptr1 = temp.data();
    {
        {
            {
                auto tmp0 = static_cast<bool>(true);
                out_ptr1[static_cast<int64_t>(0L)] = tmp0;
            }
        }
    }

    *qk_block = *out_ptr1 != 0
                    ? *qk_block
                    : -std::numeric_limits<accum_t>::infinity();
    }
}
```

**Code after this PR will be vectorized:**
```cpp
accum_t* in_ptr0 = qk_data;

auto in_ptr1 = b_idx.data();
auto in_ptr2 = h_idx.data();
auto in_ptr3 = q_idx.data();
auto in_ptr4 = kv_idx.data();

// apply score mod function
{

    accum_t* out_ptr0 = in_ptr0;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(16));
                        tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0));
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))));
                        tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))));
                    }
                }
            }
        }
    }

}

// Apply block mask, fill unused with -inf
{

    accum_t* out_ptr1 = in_ptr0;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(16));
                        auto tmp1 = static_cast<bool>(true);
                        auto tmp2 = -std::numeric_limits<float>::infinity();
                        auto tmp3 = at::vec::VecMask<float,1>::from(tmp1);
                        auto tmp4 = at::vec::Vectorized<float>(tmp2);
                        auto tmp5 = decltype(tmp0)::blendv(tmp4, tmp0, tmp3.template cast<float,1>());
                        tmp5.store(out_ptr1 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0));
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize)))
                    {
                        for (int64_t x1_tail = static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))));x1_tail < static_cast<int64_t>(cur_kvSplitSize); x1_tail++)
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x1_tail + cur_kvSplitSize*x0)];
                            auto tmp1 = static_cast<bool>(true);
                            auto tmp2 = -std::numeric_limits<float>::infinity();
                            auto tmp3 = tmp1 ? tmp0 : tmp2;
                            out_ptr1[static_cast<int64_t>(x1_tail + cur_kvSplitSize*x0)] = tmp3;
                        }
                    }
                }
            }
        }
    }

}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143638
Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel
2025-02-14 05:26:18 +00:00
ce38bfd299 [executorch hash update] update the pinned executorch hash (#147157)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147157
Approved by: https://github.com/pytorchbot
2025-02-14 05:04:17 +00:00
92f669e39c [BE] Use c10::multiply_integers in cholesky_impl (#147163)
That replaces explicit for loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147163
Approved by: https://github.com/huydhn
2025-02-14 03:59:17 +00:00
2d089a5697 [dynamo] Remove unintended lru_cache (#147147)
I forgot to remove it while add frozenset __contains__ method in this PR
- https://github.com/pytorch/pytorch/pull/146062?fbclid=IwZXh0bgNhZW0CMTEAAR3S_qq8bYxO7pDuHqpr2X-vqkXQrY0KtT14z46bfuRDYikjJBet3uKF2dE_aem_o1c7I4eawKyaEsfiWhnTmw

This is causing memory leak

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147147
Approved by: https://github.com/williamwen42
2025-02-14 03:55:39 +00:00
6344ca1dd4 [BE][Ez]: Apply FURB188: use str remove(pre|suf)fix (#146997)
Since we are on 3.9, we can use this nice str builtin which is more readable and more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146997
Approved by: https://github.com/XuehaiPan, https://github.com/cyyever, https://github.com/jansel
2025-02-14 03:38:07 +00:00
cyy
d473c212fd Remove code for Python < 3.9 (#147097)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147097
Approved by: https://github.com/albanD
2025-02-14 03:22:49 +00:00
880e176544 [inductor] Fix for pattern file contains 'getitem' fails during impor… (#144980)
…t of the pattern module

  For example any pattern module that has the following pattern generated, fails to import because
  the name getitem undefined.

  native_dropout_default = CallFunction(aten.native_dropout.default, div_Tensor_1, KeywordArg('dropout_p'), True, _users=2)
  getitem = CallFunction(getitem, native_dropout_default, 0)

  this fix will resolve the error.

Fixes #144674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144980
Approved by: https://github.com/eellison
2025-02-14 02:30:24 +00:00
0b84311842 [export] Generate printers/parsers for serialization enum values. (#147126)
Summary:
Generate two helper functions for enum classes in generated_serialization_types.h

printEnum: will convert enum values into strings.
parseEnum: will convert strings into enum values.

Test Plan: CI

Differential Revision: D69604850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147126
Approved by: https://github.com/yiming0416
2025-02-14 02:14:35 +00:00
05001f0459 Add Structured Tracing for Traced Graph Edge Details for AC Debugging (#146634)
Summary:
Updating the structured trace infrastructure so that we are able to output to Zoomer and have an E2E solution.

Context Doc: https://docs.google.com/document/d/1T6omIBEWVhbOiwDLSLffgQwjxiT2rQv8QvvQwXkw4fY/edit?usp=sharing

Test Plan:
### Testing Structured Log + tlparse locally

Command:
```
TORCH_TRACE=/data/users/basilwong/fbsource/fbcode/log_torch_trace buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2
```

Torch Trace Logs (local then sent to paste): P1686419449
```
cat log_torch_trace/dedicated_log_torch_trace_rank_0_2lg012xo.log | pastry
P1686419449
```

tlparse output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

tlparse graph edge details output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/9_0_0/joint_graph_information_397.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Differential Revision: D61557220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146634
Approved by: https://github.com/jansel, https://github.com/Yuzhen11
2025-02-14 02:04:26 +00:00
486fc12d7e torch: Log a unified waitcounter for torch.compile and triton.autotune (#146723)
Summary: Add a second more generic waitcounter to torch.compile. We'll keep expanding this as new generic pytorch compilation sites show up.

Test Plan: Waitcounter only change, relying on existing tests.

Differential Revision: D69215401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146723
Approved by: https://github.com/davidberard98
2025-02-14 02:04:13 +00:00
f0bdc27f74 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-14 02:03:53 +00:00
c5a9e4a6a0 [Inductor][CPP] Fix a CPP GEMM Template output data type issue (#146958)
**Summary**
Issue found when fixing https://github.com/pytorch/ao/issues/1662. A FP32 GEMM with an epilogue node `to_fp16` resulted in [generated code](https://gist.github.com/leslie-fang-intel/464fb112abdb105818ae09b057350e84), which failed to compile. The root cause is that we used the slice of global buffer `Y` as the output of micro GEMM instead of a `local buffer`. However, due to the `to_fp16` epilogue node, the global buffer `Y` has a float16 data type, leading to the failure. This fix will ensure the use of a local buffer in such cases.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_linear_to_lowp_fp
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146958
Approved by: https://github.com/jgong5
2025-02-14 01:40:08 +00:00
d3524ecdd6 [Break XPU] Align meta calculation for fft_r2c with _fft_r2c_mkl (#146763)
Fix #146761
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146763
Approved by: https://github.com/jansel
ghstack dependencies: #146762, #145248, #146880
2025-02-14 01:39:18 +00:00
ade5af9430 [XPU] Align XPU convolution_backward output layout between fake tensor and real output tensor. (#146880)
Fix #146879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146880
Approved by: https://github.com/EikanWang, https://github.com/jansel
ghstack dependencies: #146762, #145248
2025-02-14 01:39:18 +00:00
9befdf565a [Break XPU][Inductor UT] Set input tensors to corresponding device for test case in test_aot_indutor.py (#145248)
Fix #145247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145248
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #146762
2025-02-14 01:39:11 +00:00
972e927134 [Break XPU][Inductor UT] Fix XPU Inductor UT failures introduced from community. (#146762)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146762
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel
2025-02-14 01:38:50 +00:00
6419076db9 [torch][amdsmi] Look for amdsmi in ROCM_HOME/ROCM_PATH before using rpath (#147117)
Summary: ROCm uses ROCM_HOME/ROCM_PATH to specify which version of rocm the user wants to use. This is especially important in multi-version setups. Let's respect that behavior when loading amdsmi.

Test Plan:
CI
```
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL MSCCL_ALGO_DIR=~/2fbsource/third-party/rccl/develop/tools/msccl-algorithms RCCL_MSCCLPP_THRESHOLD=(math '128*1024*1024')  RCCL_MSCCLPP_ENABLE=1 ENABLE_MSCCLPP=1 buck2 run fbcode//mode/opt-amd-gpu -m rocm621 fbcode//accelerators/workloads/microbench:bench_comm -- --shape moe_17b --comm_algo nccl_allreduce
```

Differential Revision: D69597647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147117
Approved by: https://github.com/malfet
2025-02-14 01:11:59 +00:00
20a369aa3a [Intel GPU] Avoid copy when the input of Matmul is broadcasted (#143784)
Avoid copy when the input of Matmul is 3D and broadcasted on batch dim.  oneDNN support implicit broadcast semantics i.e., src can be broadcasted into weight if the corresponding dimension in src is 1 (and vice versa). On Max 1100, timm resmlp_12_224 amp_fp16 inference with bs=128 can improve from 42ms to 13.7 ms on torch.compile and 57.5ms to 32ms on eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143784
Approved by: https://github.com/EikanWang
2025-02-14 00:48:07 +00:00
057bcd3a45 [ca] eliminate duplicate getitem graph nodes for shape inputs (#146875)
should reuse existing proxies instead of creating new ones

before: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpL7hmHe/0_-_-_0/compiled_autograd_graph_3.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
```python
class CompiledAutograd0(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem = inputs[0]
        getitem_1 = inputs[1]
        getitem_2 = inputs[2];  inputs = None
        getitem_3 = sizes[0];  getitem_3 = None
        getitem_4 = sizes[1];  getitem_4 = None
        getitem_5 = sizes[2];  getitem_5 = None
        getitem_6 = sizes[3];  getitem_6 = None
        getitem_7 = sizes[4];  getitem_7 = None
        getitem_8 = sizes[5];  getitem_8 = None
        getitem_9 = sizes[6];  getitem_9 = None
        getitem_10 = sizes[7];  getitem_10 = None
        getitem_11 = sizes[8];  getitem_11 = None
        getitem_12 = sizes[9];  getitem_12 = None
        getitem_13 = sizes[10];  getitem_13 = None
        getitem_14 = sizes[11];  getitem_14 = None
        getitem_15 = sizes[12];  getitem_15 = None
        getitem_16 = sizes[13];  getitem_16 = None
        getitem_17 = sizes[14];  getitem_17 = None
        getitem_18 = sizes[15];  getitem_18 = None
        getitem_19 = sizes[0]
        getitem_20 = sizes[1]
        getitem_21 = sizes[2]
        getitem_22 = sizes[3]
        getitem_23 = sizes[4]
        getitem_24 = sizes[5]
        getitem_25 = sizes[6]
        getitem_26 = sizes[7]
        getitem_27 = sizes[8]
        getitem_28 = sizes[9]
        getitem_29 = sizes[10]
        getitem_30 = sizes[11]
        getitem_31 = sizes[12]
        getitem_32 = sizes[13]
        getitem_33 = sizes[14]
        getitem_34 = sizes[15];  sizes = None
```

after: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpCo5T6B/0_-_-_0/compiled_autograd_graph_1.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
```python
class CompiledAutograd0(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem = inputs[0]
        getitem_1 = inputs[1]
        getitem_2 = inputs[2];  inputs = None
        getitem_3 = sizes[0]
        getitem_4 = sizes[1]
        getitem_5 = sizes[2]
        getitem_6 = sizes[3]
        getitem_7 = sizes[4]
        getitem_8 = sizes[5]
        getitem_9 = sizes[6]
        getitem_10 = sizes[7]
        getitem_11 = sizes[8]
        getitem_12 = sizes[9]
        getitem_13 = sizes[10]
        getitem_14 = sizes[11]
        getitem_15 = sizes[12]
        getitem_16 = sizes[13]
        getitem_17 = sizes[14]
        getitem_18 = sizes[15];  sizes = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146875
Approved by: https://github.com/jansel
ghstack dependencies: #146720, #146735
2025-02-13 21:41:33 +00:00
76dacd5fc7 [ca] log graph before reodering passes (#146735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146735
Approved by: https://github.com/jansel
ghstack dependencies: #146720
2025-02-13 21:41:33 +00:00
cdbf677cdd Remove outdated comment in ATen/mkl/Sparse.h about lack of Windows support (#147125)
Fixes #147124.

* #102604 added support for Intel oneMKL Sparse BLAS APIs so there was an outdated comment left around in the codebase that can now be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147125
Approved by: https://github.com/janeyx99
2025-02-13 21:34:05 +00:00
1f41ceb713 [BE][Ez]: Enable ruff rule banning print in assert (#146615)
Enables a few ruff rules
* Ban print statements within asserts (likely bugs)
* ~Use string for Decimal literal to prevent loss of precision~
* ~Do not use default args for __post__init__ in dataclasses, they likely were meant to go into the factory method, the __init__, or somewhere else. The default values are useless here.~

Wait until ruff upgrade for the last 2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146615
Approved by: https://github.com/jansel
2025-02-13 21:14:00 +00:00
5469e5c556 [export] Minor fix to locals (#146955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146955
Approved by: https://github.com/bobrenjc93
2025-02-13 20:29:15 +00:00
7b4efb492b [inductor][refactor] Make _compile_file only used for fbcode (#147106)
Summary: _compile_file in codecache.py only handles specific cpp compilation in fbcode. The next step is to consolidate it with cpp_builder.

Test Plan: CI

Differential Revision: D69592025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147106
Approved by: https://github.com/yushangdi
2025-02-13 20:22:31 +00:00
2d3db4509a fix pt2e block wise quantization test (#147035)
Differential Revision: D69559217

https://github.com/pytorch/pytorch/pull/145941 breaks the unit test added for prepare pt2e + block wise quantization. Fixing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147035
Approved by: https://github.com/andrewor14
2025-02-13 19:44:56 +00:00
b0553cee6b [Utilization] post-test-process workflow (#145310)
# Overview
Add reusable workflow to trigger the post-test right after each test job is complete.

Cousion with pr to setup the runner permissions:
Add m fleet instances: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595/files
add to lix fleet:https://github.com/pytorch/ci-infra/pull/322/files

Currently I turn on the debug flag for testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145310
Approved by: https://github.com/huydhn
2025-02-13 18:51:19 +00:00
260b21b8bc [cutlass backend] Do not change dtype of GEMM template (#146877)
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69085556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877
Approved by: https://github.com/ColinPeppler
2025-02-13 18:36:16 +00:00
92d448ff62 Add self to CODEOWNERS for fx/proxy.py; warn against adding new node arg types (#147031)
Not sure if there's a better way

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147031
Approved by: https://github.com/StrongerXi
ghstack dependencies: #147016, #147012, #147013
2025-02-13 18:21:21 +00:00
9a883007a2 Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)"
This reverts commit c7515da7b00de40942c83dc5856b6daec727e280.

Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))
2025-02-13 18:04:26 +00:00
65e8862b9a Revert "[cond] make cond re-dispatch in proxy mode (#146954)"
This reverts commit 2ce6de2415fb6592dd4447ebea334fd12b8c31ea.

Reverted https://github.com/pytorch/pytorch/pull/146954 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert it to cleanly revert 140979 ([comment](https://github.com/pytorch/pytorch/pull/146954#issuecomment-2657357742))
2025-02-13 18:02:33 +00:00
1f8ff6812d [Fix]: Disable KleidiAI if unsupported gcc/clang compiler is detected (#146836)
Fixes: https://github.com/pytorch/pytorch/issues/146740

Description:
1. KleidiAI officially supports GCC>=11 and Clang>=11. Certain hardware features are tied to compiler version and KleidiAI compilation will fail in such cases.

Change-Id: Ib43d6b5bf66ef5ea48c481a2774801c573ec205c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146836
Approved by: https://github.com/malfet
2025-02-13 17:49:26 +00:00
447a142de2 support input mutations on tangents in compile (#141131)
Fixes https://github.com/pytorch/pytorch/issues/141111. We previously supported mutations on saved activations that happened in the backward. This PR extends the support to tangents

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141131
Approved by: https://github.com/zou3519
2025-02-13 17:48:56 +00:00
7077d0ac8c [DCP] Introduce modules metadata in the storage_meta (#146654)
Summary: Introduce the list of modules in the storage_meta which is shared between the planner and the storage writer. We will use it to let the storage writer know about the modules in the state dict and create module directories in the checkpoint.

Test Plan: UTs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146654
Approved by: https://github.com/MeetVadakkanchery
2025-02-13 17:44:30 +00:00
938209fb6f Revert "Use 2022 as default VC_YEAR for windows builds (#147053)"
This reverts commit 858bc0cea50614d1e190e6991d974ddb0f53fc88.

Reverted https://github.com/pytorch/pytorch/pull/147053 on behalf of https://github.com/atalman due to Broke windows tests ([comment](https://github.com/pytorch/pytorch/pull/147053#issuecomment-2657239501))
2025-02-13 17:09:37 +00:00
683178fabc [cuda] fix printing of num_gpus (#146838)
Previously on machines with less than 8 gpus, the device==7 case would
trigger the assert inside getDeviceProperties, and print `num_gpus=BEL`
which is ascii for 7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146838
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-02-13 15:23:35 +00:00
020232ec9f [Submodule]: Update KleidiAI submodule to v1.3.0 (#146480)
Change-Id: I687255982c72ee7daca438a15b718f07298963cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146480
Approved by: https://github.com/digantdesai, https://github.com/malfet
2025-02-13 15:23:04 +00:00
df776d64f7 chore: fix typos in error messages in FSDP (#146805)
Fixes two small typos in FSDP error messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146805
Approved by: https://github.com/awgu, https://github.com/Skylion007
2025-02-13 15:22:13 +00:00
345f556628 Fix DispatchStub.cpp compilation for gcc 14 (#146512)
Otherwise I get the following error:

```bash

.../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: error: no matching function for call to ‘find(std::array<c10::DeviceType, 7>::const_iterator, std::array<c10::DeviceType, 7>::const_iterator, const c10::DeviceType&)’
  152 |     if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) {
      |         ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/c++/14/bits/locale_facets.h:48,
                 from /usr/include/c++/14/bits/basic_ios.h:37,
                 from /usr/include/c++/14/ios:46,
                 from /usr/include/c++/14/ostream:40,
                 from .../intel-xpu-backend-for-triton/pytorch/c10/core/DeviceType.h:13,
                 from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.h:3,
                 from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:2:
/usr/include/c++/14/bits/streambuf_iterator.h:435:5: note: candidate: ‘template<class _CharT2> typename __gnu_cxx::__enable_if<std::__is_char<_CharT2>::__value, std::istreambuf_iterator<_CharT, std::char_traits<_CharT> > >::__type std::find(istreambuf_iterator<_CharT, char_traits<_CharT> >, istreambuf_iterator<_CharT, char_traits<_CharT> >, const _CharT2&)’
  435 |     find(istreambuf_iterator<_CharT> __first,
      |     ^~~~
/usr/include/c++/14/bits/streambuf_iterator.h:435:5: note:   template argument deduction/substitution failed:
.../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: note:   mismatched types ‘std::istreambuf_iterator<_CharT, std::char_traits<_CharT> >’ and ‘const std::array<c10::DeviceType, 7>::value_type*’ {aka ‘const c10::DeviceType*’}
  152 |     if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) {
      |         ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146512
Approved by: https://github.com/Skylion007
2025-02-13 15:21:59 +00:00
7c3b2a29ec [subclass] testing WrapperSubclass respect outer_size, outer_stride (#146897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146897
Approved by: https://github.com/bdhirsh
2025-02-13 15:21:19 +00:00
e2479d7809 Update slow tests (#146822)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146822
Approved by: https://github.com/pytorchbot
2025-02-13 15:20:58 +00:00
aeabbffe15 Disable test with dynamo for schema gen (#146865)
Fixes https://github.com/pytorch/pytorch/issues/141202.

1. So we skip the schema gen tests under dynamo. https://github.com/pytorch/pytorch/issues/141202 fails in a weird way: where it's claiming node is an integer, but we tested isinstance tests [here](https://github.com/pytorch/pytorch/blob/main/torch/_library/utils.py#L234-L241). This is probably dynamo messing up with the stacks. and checking fx.Node isn't really what dynamo is designed for.
2. We move some of legit cond testes out of schema gen and put it back to control flow tests. Also rename _test_export to a lengthy names.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146865
Approved by: https://github.com/zou3519
2025-02-13 15:20:52 +00:00
67c4c39b4f [docs] Minor fixes to export and aoti docs (#144513)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144513
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-02-13 15:19:35 +00:00
d1997b610f update kineto submodule (#147015)
Fix https://github.com/pytorch/kineto/issues/1032
See https://github.com/pytorch/kineto/pull/1035 for testplan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147015
Approved by: https://github.com/sraikund16, https://github.com/Skylion007
2025-02-13 15:13:18 +00:00
8d94eb1e3b [BE]: Make OrderedSet reversible (#146904)
It's rather trivial to make OrderedSet reversible, so let's do it and unlock that additional functionality for downstream users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146904
Approved by: https://github.com/eellison
2025-02-13 15:11:48 +00:00
858bc0cea5 Use 2022 as default VC_YEAR for windows builds (#147053)
New Windows AMI does not have Visual Studio 2019. Hence use 2022 as default.
See: https://github.com/pytorch/test-infra/pull/6226
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147053
Approved by: https://github.com/huydhn
2025-02-13 14:37:55 +00:00
f95bdf5e6c Make GetCPUAllocatorMaybePinned to be Device-Agnostic (#146687)
----

- Keep cuda first to perserve BC
- Remove cuda first if it is possible to have only one accelerator at a time in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146687
Approved by: https://github.com/ngimel
2025-02-13 13:09:48 +00:00
e21181642f [AOTInductor] Align behavior between CPU and GPU (#145459)
Summary:
(1) Make sure CPU and GPU doesn't have different implementation and behavior when calling from the same path and API. Only difference between CPU and GPU after this PR should ONLY be the running hardware.
(2) This PR fixes the issue of memory access when it==constants_map.end()
(3) This PR resolves T179437596

Test Plan: buck2 run mode/dev sigmoid/inference/test:e2e_test_cpu

Differential Revision: D68540744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145459
Approved by: https://github.com/desertfire, https://github.com/hl475
2025-02-13 09:50:18 +00:00
ca3aabc8e6 [Inductor][CPU] Add a lowering pass for _weight_int4pack_mm_for_cpu (#145250)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds a lowering pass for `torch.ops.aten_weight_int4pack_mm_for_cpu`. This op is used for WoQ int4 in Torchao. The lowering pass is a prerequisite for max-autotune, which is planed to be enabled for this op in subsequent PRs.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int4
python test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145250
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #145245
2025-02-13 08:40:12 +00:00
17d3a69c32 [Intel GPU] fix memory leak in deconv backward (#144385)
Fixes #143807

We need manage onednn scratchpad in pytorch, otherwise onednn will always allocate scratchpad memory during primitive execution and causes memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144385
Approved by: https://github.com/liangan1, https://github.com/EikanWang
2025-02-13 07:41:34 +00:00
43496e9b90 [NJT] fix flop counter for SDPA & test (#147032)
Fixes 3 issues:
1. The test wasn't actually testing SDPA: both were checking cuda, and the inputs to SDPA were not transposed.
2. FlopCounterMode has been renamed _FlopCounterMode (and a wrapper named FlopCounterMode has been added)
3. offsets_to_list also needs to ignore the actual offset values if offsets is a meta tensor.

Differential Revision: [D69558785](https://our.internmc.facebook.com/intern/diff/D69558785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147032
Approved by: https://github.com/jbschlosser
2025-02-13 07:14:58 +00:00
tim
b9a22b3f37 bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps (#146623)
This pr addresses the issue in the MPS backend for `_scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape.

The issue was found in https://github.com/hiyouga/LLaMA-Factory/issues/6835, in [transformers qwen2_vl](1590c66430/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (L373C14-L373C93)), 3d q/k/v were passed into sdpa function, which lead to an error.

Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms.

---
reproduce code:
```
import torch
import torch.nn.functional as F

head_num, seq_len, embed_dim = 16, 16, 80
bsz = 1

q = torch.randn(head_num, seq_len, embed_dim)
k = torch.randn(head_num, seq_len, embed_dim)
v = torch.randn(head_num, seq_len, embed_dim)
attention_mask = torch.ones(1, seq_len, seq_len)

oo_cpu = F.scaled_dot_product_attention(
    q.to("cpu"),
    k.to("cpu"),
    v.to("cpu"),
    attention_mask.to("cpu"),
    dropout_p=0.0
)

if torch.backends.mps.is_available():
    oo_mps = F.scaled_dot_product_attention(
        q.to("mps"),
        k.to("mps"),
        v.to("mps"),
        attention_mask.to("mps"),
        dropout_p=0.0
    )
    assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5)
```

error outputs:
```
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module>
    oo_mps = F.scaled_dot_product_attention(
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```

hardware and envs:
```
torch               2.6.0
apple m3 max
```

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146623
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-13 07:00:51 +00:00
17a808557c [MPS] cholesky ex version (#146799)
PR #145701 didn't have experimental version of cholesky. This PR adds that version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146799
Approved by: https://github.com/malfet
2025-02-13 07:00:21 +00:00
4879f8f919 [TP] Add warning when module is distributed twice (#147006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147006
Approved by: https://github.com/XilunWu
2025-02-13 06:49:17 +00:00
3e4172d985 [BE][Ez]: Update fmtlib submodule to 11.1.3 (#146985)
This submodule just fixes a bunch of miscellaneous bugfix issues with ABI compatibility, compiler warning, workarounds for older compilers, performance, and edge cases in formatting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146985
Approved by: https://github.com/drisspg
2025-02-13 06:47:11 +00:00
aa20b4b6cf Friendly handle mem_get_info's runtime error message (#146899)
# Motivation
Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899
Approved by: https://github.com/EikanWang
2025-02-13 06:26:19 +00:00
66fb10fc53 [BE][OpInfo] Introduce generic dtypesIf (#146905)
Use `__setattr__` and `__getattribute__` to wrap existing `dtypesIfXYZ` into it, which will allow for subsequent incremental elimination of those

Also, type annotation for OpInfo is a sham: it claims that `dtypes` and `dtypesIfXYZ` must be of type `_dispatch_dtypes`, but in reality it's converted to set in post init.

Test Plan:
 - Check that `op_db[0].dtypesIfCUDA` and others shows the same values as before, by running the following script
 ```python
from torch.testing._internal.common_methods_invocations import op_db
print({name: getattr(op_db[0], f'dtypesIf{name}') for name in ['CUDA', 'ROCM', 'XPU', 'Hpu']})
```
 - CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146905
Approved by: https://github.com/janeyx99
2025-02-13 05:33:17 +00:00
43eb39d7c8 [executorch hash update] update the pinned executorch hash (#145128)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145128
Approved by: https://github.com/pytorchbot
2025-02-13 05:06:44 +00:00
88d0bb0fee [aoti_debug_printer][BE] explicitly dumping float32, bfloat16, float16 data type (#147020)
Summary:
per request, explicitly dumping the float dtypes for aten tensors in debug printing summary info.

can be useful in identifying issues such as "wrong AOTI Lowering precisions"

Test Plan:
```
 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm
```

Differential Revision: D69547344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147020
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-02-13 04:41:00 +00:00
2ff3fdfdae [audio hash update] update the pinned audio hash (#146738)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146738
Approved by: https://github.com/pytorchbot
2025-02-13 04:29:46 +00:00
936df4571b Update test_c10d_object_collectives.py with DistributedTestBase class (#145056)
# MOTIVATION
To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR #138216](https://github.com/pytorch/pytorch/pull/138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs.

# CHANGES

Replaced MultiProcessTestCase with the DistributedTestBase class.
Extended test functionality to include support for HPUs.
Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145056
Approved by: https://github.com/kwen2501, https://github.com/guangyey
2025-02-13 03:57:59 +00:00
a9598337b7 [Optimus] Include more corner cases in the select cat aten pass (#146662)
Summary: Thanks to Shuai for reporting the bug in the pattern. We found there's a typo in the pass, where we should make sure all the selects will go to the cat node.

Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad

Buck UI: https://www.internalfb.com/buck2/2cd0888e-d803-43a8-8530-d97e6bc281b3
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449699305108
Network: Up: 110KiB  Down: 35KiB  (reSessionID-687be0fa-031a-47a0-8780-5ab4cf4bbd94)
Executing actions. Remaining     0/4                                                                              6.6s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:12.0s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D69278487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146662
Approved by: https://github.com/Microve
2025-02-13 03:40:26 +00:00
6ca497a8e5 Replace is_same with is_same_v for concise syntax (#145450)
Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450
Approved by: https://github.com/huydhn
2025-02-13 03:29:39 +00:00
c159723c39 Fix meta impl for topk (#147017)
Topk in this context is always size-like so we should use torch._check_is_size. Fixes some issue in https://github.com/pytorch/pytorch/issues/146990

Differential Revision: [D69545983](https://our.internmc.facebook.com/intern/diff/D69545983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147017
Approved by: https://github.com/ydwu4
2025-02-13 03:18:47 +00:00
821422018a [FlexAttention] Make zero_length sequence handiling better (#147010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147010
Approved by: https://github.com/Chillee
2025-02-13 03:18:24 +00:00
54e28b2a71 [BE] Turn nextafter into functor (#147018)
This functor is a bit more involved as nextafter is missing for MacOS13
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147018
Approved by: https://github.com/dcci
ghstack dependencies: #146965, #146993, #147023
2025-02-13 02:10:29 +00:00
aaa46c0625 Add missing autoreleasepool around runUniqueGraph to prevent leaks (#145512)
References were held onto longer than needed. Added autoreleasepool around the runUniqueGraph to allow the memory to be freed.

Fixes #145151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145512
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-13 01:58:18 +00:00
e0ca041ae3 [BE] Toward Metal Iterator (step 2) (#147023)
Add dense flavor of the binary ops, i.e. if iterator is contiguous, do not build indices but rather run different flavor, using the same functor, which results in almost 100% perf gain for binary operation with 1mln elements of `torch.fmax` as one can see from the table below collected on M4Pro Mini using following benchmarking script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_binary(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()",
        setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024**2
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_binary(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```

 Dtype  | Time before | Time After |
| ------|------------ | ---------- |
| float32  | 0.84 msec  | 0.66 msec |
| float16  |  0.49 msec |  0.23 msec |
| bfloat16  | 0.48 msec  | 0.22 msec |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147023
Approved by: https://github.com/dcci
ghstack dependencies: #146965, #146993
2025-02-13 01:50:43 +00:00
80f146dedf Update addbmm, addmm, addmv and baddbmm description (#146689)
Fixes #146611, following #146482

## Test Result

![image](https://github.com/user-attachments/assets/5c1749be-1f10-4e80-a284-b1929ca340eb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146689
Approved by: https://github.com/mikaylagawarecki
2025-02-13 01:30:50 +00:00
5dab0aeef0 [SkipFiles] Some more cleanup (#147013)
This isn't a no-op but I think it's fine. It changes the case where a
function f1 in a module in MOD_SKIPFILES calls a function f2 in one of
the deleted modules. Previously f2 would have been skipped, now f2 gets
inlined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147013
Approved by: https://github.com/yanboliang
ghstack dependencies: #147016, #147012
2025-02-13 01:18:47 +00:00
fddaa2958b [SkipFiles] Some more cleanup (#147012)
I think these are all no-ops.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147012
Approved by: https://github.com/yanboliang
ghstack dependencies: #147016
2025-02-13 01:18:47 +00:00
87ebd77b34 Add some more docs to trace_rules.py (#147016)
After discussing with Yanbo we wanted to record the behavior down so we
don't need to rederive them in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147016
Approved by: https://github.com/yanboliang
2025-02-13 01:18:39 +00:00
b77a6eb184 [dynamo] Fix tensordict regression (#146995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146995
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146819
2025-02-13 00:59:59 +00:00
2ce6de2415 [cond] make cond re-dispatch in proxy mode (#146954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954
Approved by: https://github.com/zou3519
2025-02-13 00:50:33 +00:00
67cbbb29e0 [export] Dedup expression_created logs (#146859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146859
Approved by: https://github.com/pianpwk
ghstack dependencies: #146532, #146533, #146534, #146858
2025-02-13 00:21:34 +00:00
59bc5d0d71 [tlparse] Add stacktrace filter utility (#146858)
Added a utility function for capturing the user stack and framework stacktrace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146858
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #146532, #146533, #146534
2025-02-13 00:21:34 +00:00
43f5566c92 [export] Add additional tlparse logging (#146534)
Added some additional logging so we can also run tlparse on generic export errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146534
Approved by: https://github.com/pianpwk
ghstack dependencies: #146532, #146533
2025-02-13 00:21:34 +00:00
b4bdbce1ac [export] Use custom stream logger in draft-export (#146533)
Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows:

```python
        if key == "missing_fake_kernel":
            return hash((key, data["op"]))  # Same ops get deduped
        elif key == "mismatched_fake_kernel":
            return hash((key, data["op"], data["reason"]))  # Same op and reason for errors get deduped
        elif key == "propagate_real_tensors":
            return hash((key, json.dumps(data["stack"])))  # Guards appearing on the same stacktrace get deduped
        elif key == "create_unbacked_symbol":
            return hash((key, json.dumps(data["stack"])))  # Unbacked symbols appearing on the same stacktrace get deduped
```

Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this.

The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146533
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #146532
2025-02-13 00:21:34 +00:00
be387f57b1 [symbolic shapes] Log SymNode id for provenance (#146532)
We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse:
<img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146532
Approved by: https://github.com/bobrenjc93
2025-02-13 00:21:34 +00:00
21c2565f35 Document dynamo (#146736)
Many files in dynamo are currently lacking file/module-level documentation, which makes it hard to know what they do at a glance and without digging into the code. This fixes that.

Note: documentation was AI-generated and could be incorrect, please review carefully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146736
Approved by: https://github.com/jansel, https://github.com/StrongerXi, https://github.com/anijain2305, https://github.com/zou3519
2025-02-13 00:02:21 +00:00
0344bf8a5a [cuDNN] cuDNN to 9.7.1.26 for CUDA 12.8 (#146957)
rebasing for https://github.com/pytorch/pytorch/pull/146717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146957
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/atalman

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-12 23:43:34 +00:00
d5a2e4c754 [oncall] Change error message to be more readable (#146934)
Summary:
During oncall, got a debug, where the error message is a bit ambiguous, due to multiple colons, and full line cutoff
```
AssertionError: Expected order: 1 for the component: remote_request_only to be >= 2, the max order for all its
```

Update the error message to something like
```
AssertionError: Component remote_request_only order must be >= max order of its upstream components, got component order=1 and max=2
```

Test Plan: CI

Differential Revision: D69482789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146934
Approved by: https://github.com/ColinPeppler
2025-02-12 23:33:09 +00:00
ad4e5bf705 cpp_wrapper: handle mixed-device C-shim fallbacks (#146449)
Fixes an error from test_torch, where a CUDA cpp_wrapper run called a CUDA native C-shim kernel with two CPU tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146449
Approved by: https://github.com/desertfire
2025-02-12 23:21:04 +00:00
076215944a Turn on autograd local caches in fbcode (#146996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146996
Approved by: https://github.com/jamesjwu
2025-02-12 23:04:39 +00:00
c60f587c04 Fix shape_inference for V-schedules (#147000)
I was hitting a hang in shape_inference when testing v-shaped schedules with >2 ranks in titan.

`self.next_rank` and `self.prev_rank` are used in shape inference but are not accurate for v-shaped schedules:
bfcce6984b/torch/distributed/pipelining/stage.py (L1325-L1326)

Will clean up / delete the use of next_rank / prev rank in follow up PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147000
Approved by: https://github.com/wconstab
2025-02-12 22:56:46 +00:00
f954aac6be Add make_dynamo_test (#146491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146491
Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/malfet
2025-02-12 22:54:29 +00:00
fd21126007 [ONNX] Deprecation message follow up (#147005)
Follow up on https://github.com/pytorch/pytorch/pull/146923 to address comments.

This pull request includes updates to the `torch/onnx` module, focusing on deprecations and documentation improvements. The most important changes involve moving version change notes within the `export` function, updating deprecation messages, and removing example code in the `dynamo_export` function.

Documentation and Deprecation Updates:

* [`torch/onnx/__init__.py`](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184): Moved version change notes to the correct location within the `export` function's docstring. Updated the deprecation note for the `dynamo_export` function to version 2.7 and removed example code from its docstring. [[1]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184) [[2]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553R349-R357) [[3]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L434-R430) [[4]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L445-L475)

* [`torch/onnx/utils.py`](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114): Enhanced deprecation messages for several functions (`select_model_mode_for_export`, `disable_apex_o2_state_dict_hook`, `setup_onnx_logging`, `unconvertible_ops`) to provide clearer guidance on their removal and suggest copying logic if needed. [[1]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114) [[2]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL148-R151) [[3]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL166-R173) [[4]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1180-R1189) [[5]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1190-R1199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147005
Approved by: https://github.com/titaiwangms
2025-02-12 22:48:56 +00:00
f655f840b8 [ONNX][dort] Remove reference to onnxscript rewriter (#147003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147003
Approved by: https://github.com/titaiwangms, https://github.com/gramalingam, https://github.com/shubhambhokare1
2025-02-12 22:02:07 +00:00
995f607c74 fix doc string (#146968)
Fixes a wrong function name in doc string

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146968
Approved by: https://github.com/zackycao, https://github.com/H-Huang
2025-02-12 21:43:16 +00:00
06a07f6018 [BE] Towards MetalTensorIterator (#146993)
Further refactor binary kernels to replace individual implementation with a binary_indexing_kernel template that takes functors that implement the logic.

According to godbolt such refactoring should have no impact on the performance as compiler thru dead code elimination should just replaces the functor with direct underlying function call as one can see for clang CPU compiler here: https://godbolt.org/z/8dxv5jvz7 but to be on the safe side, run following benchmark
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_binary(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()",
        setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024**2
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_binary(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```

That reports roughly identical before and after times (1 msec for float32 and .5 msec for float16)

Another interesting quirk, that functors can not be in anonymous namespace, otherwise they'll not be visible from the library, as one can see by running following swift sample (filed FB16490467 to clarify if this is supported)
```swift
let shader_source = """
struct add_functor {
  template <typename T>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(a + b);
  }
};

namespace {
struct sub_functor {
  template <typename T>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(a - b);
  }
};
} // anonymous namespace

template <typename T, typename F>
kernel void binary_executor(
    constant T* input [[buffer(0)]],
    constant T* other [[buffer(1)]],
    device T* out [[buffer(2)]],
    uint tid [[thread_position_in_grid]]) {
  F f;
  out[tid] = f(input[tid], other[tid]);
}

template
[[host_name("add_float")]] kernel void binary_executor<float, add_functor>(constant float*, constant float *, device float*, uint);

template
[[host_name("sub_float")]] kernel void binary_executor<float, sub_functor>(constant float*, constant float *, device float*, uint);
"""

import Metal
guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
let library = try! device.makeLibrary(source:shader_source, options:MTLCompileOptions())

// Expect two kernels to be printed, but see only one, with functor in global namespace
for kernel_name in library.functionNames {
  print(kernel_name)
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146993
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #146965
2025-02-12 21:40:40 +00:00
de964b9f8b dont specialize symints when testing truthiness (#146731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146731
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #146642, #146729
2025-02-12 20:57:10 +00:00
5cda021cac support meta_tensor.to(device='cpu') under fake_mode (#146729)
Fixing this is actually a bit annoying:

(1) FakeTensorMode sees a function where all of its inputs are real tensors, so it tries to run the real compute before converting the output to a FakeTensor

(2) we don't actually want this, because the "real compute" is support to error normally, when you do `meta_tensor.to(device='cpu')`. Instead, we want FakeTensor to actually skip constant prop and run the normal FakeTensor implementation, which will not error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146729
Approved by: https://github.com/zou3519, https://github.com/SherlockNoMad, https://github.com/albanD
ghstack dependencies: #146642
2025-02-12 20:57:10 +00:00
ec0b318ddb [poc] force UntypedStorage.from_buffer(buf) to return meta storage under FakeTensorMode (#146642)
context here: https://fb.workplace.com/groups/326136610199609/permalink/495389539940981/

This PR is an attempt to make it such that if you create a tensor from an external buffer (using `UntypedStorage.from_buffer(buf)`, we can generate a proper fake tensor for you out of the box.

The annoying bit is that there are not any dispatcher ops to interpose on and change behavior. So instead, I took the manual C binding and tweaked the storage device to be "meta' if we see an active fake mode.

Put "poc" in the title since I... think this is hopefully reasonable, but I can be convinced that it's not :)

```
from torch._subclasses.fake_tensor import FakeTensorMode
import pickle
import io
import torch
from contextlib import nullcontext

use_fake_tensor = True
with FakeTensorMode() if use_fake_tensor else nullcontext():
    obj = [1, 2]
    f = io.BytesIO()
    pickle.Pickler(f).dump(obj)
    byte_storage = torch.ByteStorage._from_buffer(f.getvalue())  # type: ignore[attr-defined]

    t = torch.ByteTensor(byte_storage)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146642
Approved by: https://github.com/zou3519
2025-02-12 20:57:10 +00:00
8a975cb247 Revert "[cutlass backend] Do not change dtype of GEMM template (#146877)"
This reverts commit 5f2714d5e7cded0eb553d5915002e03c22e01e34.

Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to mistake on logging ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2654648949))
2025-02-12 19:26:45 +00:00
0de27ee7e0 Let _create_cpu_state_dict and _copy_state_dict support DTensor (#146852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146852
Approved by: https://github.com/d4l3k
2025-02-12 18:43:52 +00:00
352484cc83 [BE] Unify kernel templates instantiation (#146965)
By defining `REGISTER_BINARY_OP` template that could be used to register fmix, fmax, etc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146965
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-12 18:40:45 +00:00
7f62616a58 [ONNX][reland2] Create deprecation warning on dynamo_export (#146923)
Reland two PRs
- https://github.com/pytorch/pytorch/pull/146425
- https://github.com/pytorch/pytorch/pull/146639

Fixed by removing the deprecation warning on a base class `ExportOptions`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146923
Approved by: https://github.com/titaiwangms
2025-02-12 18:28:37 +00:00
5f2714d5e7 [cutlass backend] Do not change dtype of GEMM template (#146877)
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69085556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877
Approved by: https://github.com/ColinPeppler
2025-02-12 18:16:49 +00:00
bfcce6984b [ROCm][TunableOp] Close offline tuning results file when offline tuning is disabled. (#146574)
This PR is to fix UT breakage that has been reported internally and is considered high priority. When `tunable.record_untuned_enable(False)` is invoked, we flush the results of the untuned gemm file.

Offline tuning I/O currently doesn't have a set untuned results filename member function or untuned results write to file member function. When performing back-to-back unit tests, the same ofstream ends up getting reused between UTs. Due to the way the UT are executed, this can lead to unexpected failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146574
Approved by: https://github.com/jeffdaily
2025-02-12 18:03:06 +00:00
04011304e5 Update dynamo expected 20250210 (#146856)
Update all the ci accuracy expect values to make trunk green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146856
Approved by: https://github.com/yanboliang
2025-02-12 18:01:20 +00:00
d6513f3246 [dynamo] Support list subclasses and fix dict subclasses mutation bugs (#146819)
This PR adds support for list subclasses. Among other things are

1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it.
2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`).
3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods.

If this PR is hard to review, please let me know. I can break it into several small PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-02-12 17:46:02 +00:00
6c81435f16 [ONNX] Update CI transformers cache (#146926)
The cached models are outdated because the related tests are all deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146926
Approved by: https://github.com/justinchuby
2025-02-12 17:02:43 +00:00
b894c2824b [ONNX] Support custom axis name through dynamic_shapes (#146321)
Fixes #143443

This PR aims to support custom dynamic axis naming through dynamic_shapes. Currently, _Dim and _DimHint do not support dynamic axis naming (#144273).

1. **the original dynamic shapes guarantee**
The axis renaming is only applied when dynamic shapes include string instead of all _Dim and _DimHint. Thus, there will not be any inconsistent behavior to dynamic_shapes with torch.export.export if the given dynamic shapes follow torch.export.export format.
2. _DimHint.AUTO is applied to the axes that are specified with custom names to avoid exporter crash. (_DimHint.DYNAMIC crashes when the export fails.)
3.  There's no need to handle cases where kwargs are out of order with the model signature,
    as torch.export.export supports dynamism only when kwargs and dynamic_shapes are provided in order.
    49082f9dba/torch/export/_trace.py (L2034)
4. If `torch.onnx.ExportedProgram` finds the axes share the same constraints, they will have the same name (e.g. s0, s1, ...). Therefore, even if the ONNX users specify them with different custom names, they won't be respected.

Example model:
```python
        class NestedModel(torch.nn.Module):
            def forward(
                self,
                x: torch.Tensor,
                ys: list[torch.Tensor],
                zs: dict[str, torch.Tensor],
                c: torch.Tensor,
            ):
                y = ys[0] + ys[1] + zs["a"] + zs["b"]
                w = 5
                if x.shape[0] < 3 and c.shape[0] != 4:
                    return x + w, x + y, c
                else:
                    return x - w, x - y, c

        input = (
            torch.ones(5),
            [torch.zeros(5), torch.ones(5)],
            {"a": torch.zeros(5), "b": torch.ones(5)},
            torch.ones(6),
        )

        dynamic_shapes = (
            {0: torch.export.Dim("dim_x", min=3)},  # _Dim
            [("custom_name_axis_ys_0",), (torch.export.Dim.AUTO,)],  # custom name
            {
                "a": {0: torch.export.Dim.AUTO},
                "b": ("custom_name_axis_zs_b_0",),
            },  # _DimHint
            {0: "custom_name_axis_c_0"},  # custom name
        )

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146321
Approved by: https://github.com/justinchuby
2025-02-12 17:00:03 +00:00
9abaaad6a8 [pytree][Easy] preserve dict keys in insertion order in CXX pytree (#130140)
`optree` and JAX pytree traversal the `dict` in sorted key ordering (see [Key Ordering for Dictionaries](https://github.com/metaopt/optree#key-ordering-for-dictionaries)). While in PyTorch Python pytree, we traversal the `dict` in insertion order. See also:

- #114392

This aligns the behavior of CXX pytree with Python pytree.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130140
Approved by: https://github.com/zou3519
2025-02-12 16:41:49 +00:00
1f8ff94d4f PEP585: Add noqa to necessary tests (#146391)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146391
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-12 15:29:50 +00:00
b61032fcf7 [BE][Ez]: Remove unnecessary type ignores from orderedset (#146902)
After #145783, we can remove some type ignores from the ordered set class
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146902
Approved by: https://github.com/eellison
2025-02-12 15:00:13 +00:00
ce80865f13 Revert "Replace is_same with is_same_v for concise syntax (#145450)"
This reverts commit 5205158c1b0bc5c390b2a9d83fe3b2ec5edbe3f2.

Reverted https://github.com/pytorch/pytorch/pull/145450 on behalf of https://github.com/jeanschmidt due to testing to see if reverting would fix timeout in inductor jobs ([comment](https://github.com/pytorch/pytorch/pull/145450#issuecomment-2653645466))
2025-02-12 13:01:32 +00:00
b0042286d4 [Dynamo] Allow dynamo to handle str.xxx() (#146587)
Fixes #146350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146587
Approved by: https://github.com/zou3519
2025-02-12 08:54:10 +00:00
98e16012ec [Quant][CPU] add a wrapper op for _weight_int4pack_mm_for_cpu with tensor args (#145245)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds a wrapper op in `quantized` namespace for `torch.ops.aten_weight_int4pack_mm_for_cpu`, whose arguments are all tensors. It will be used in Inductor lowering with max-autotune where scalar arguments are difficult to handle.
The new op is not registered to
- `aten` because it will require changing `native_functions.yaml`, which is not recommended.
- `quantized_decomposed` because it will only have a Python implementation, which cannot be used for cpp wrapper in Inductor.

**Test plan**
```
python test/test_linalg.py -k test__int4_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145245
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2025-02-12 08:46:38 +00:00
ac0f206f3c [dtensor] fix side-effect on dtype for _like ops (#146869)
fixes https://github.com/pytorch/pytorch/issues/146749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146869
Approved by: https://github.com/yifuwang, https://github.com/janeyx99, https://github.com/ngimel
2025-02-12 08:42:14 +00:00
d774a6333d [StaticRuntime] Support a new pattern for ClipRangesToGatherToOffsets (#146931)
Summary:
Support the following new pattern for ClipRangesToGatherToOffsets:

Before optimization:
```
%18267 : Tensor, %18268 : Tensor = fb::clip_ranges_gather(%int_77.1, %getitem_2484.1, %493)
%getattr_368.1 : int = prim::dtype(%18267)
%to_443.1 : Tensor = aten::to(%18268, %getattr_368.1, %self._maybe_compute_kjt_to_jt_dict.is_weighted, %self._maybe_compute_kjt_to_jt_dict.is_weighted)
%lengths_to_offsets_490.1 : Tensor = fb::lengths_to_offsets(%to_443.1, %8)
```

After optimization:
```
%18297 : int = prim::dtype(%int_77.1)
%18298 : Tensor, %18299 : Tensor = fb::clip_ranges_gather_to_offsets(%int_77.1, %getitem_2484.1, %493, %8, %18297)
```

Reviewed By: garroud

Differential Revision: D69373835

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146931
Approved by: https://github.com/hanyilou123
2025-02-12 08:19:41 +00:00
ae5cc19ba7 [pytorch][cuda] Improve softmax backward pass native CUDA implementation (#145866)
This PR is similar to https://github.com/pytorch/pytorch/pull/122970, but works on the softmax backward pass.

Specifically, it uses shared memory to cache the gradOutput when it can fit in shared memory. Before this PR we were reading gradOutput twice.

On my H100 this seems to improve the softmax backward pass performance by about 5% for problem sizes that fit within shared memory. (Note that this is not the only kernel that runs when you call softmax backward pass -- there is an elementwise kernel that runs before this; optimizing that can be a separate PR).

**Important Note**: Currently the softmax backward pass consists of an [element-wise multiply operator](7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1216)), followed by [this function](7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1062)) which calls the `cunn_SoftMaxBackward` kernel. With my change the kernel time reduces by about 12% (see screenshot below), while the total time (including the elementwise) reduces by about 5%.

```
Baseline						This PR
N	size	FP32 bandwidth	FP16 bandwidth		N	size	FP32 bandwidth	FP16 bandwidth		fp32 diff	fp16 diff
0	256	134.340966	70.042039		0	256	133.70146	70.342753		-0.48%	0.43%
1	512	233.501185	129.945803		1	512	234.057145	132.933066		0.24%	2.30%
2	1024	340.667966	229.280464		2	1024	338.833265	226.441699		-0.54%	-1.24%
3	2048	379.643726	337.452058		3	2048	399.559017	338.432284		5.25%	0.29%
4	4096	416.597537	383.625364		4	4096	428.252403	396.137506		2.80%	3.26%
5	6000	431.198241	384.384384		5	6000	457.744577	406.06275		6.16%	5.64%
6	8192	462.811252	427.292573		6	8192	474.791032	428.281563		2.59%	0.23%
7	10000	464.258731	429.050294		7	10000	483.7643	446.849381		4.20%	4.15%
8	10013	465.199701	429.824179		8	10013	464.904407	428.72184		-0.06%	-0.26%
9	10240	477.07359	428.853737		9	10240	485.317024	444.902586		1.73%	3.74%
10	11000	473.038785	430.778663		10	11000	488.161438	453.462162		3.20%	5.27%
11	12000	474.342475	432.594814		11	12000	490.532418	458.427653		3.41%	5.97%
12	16384	487.468854	473.611576		12	16384	488.154406	476.264631		0.14%	0.56%
13	20000	482.029793	465.666186		13	20000	482.147092	483.886193		0.02%	3.91%
14	24000	478.368093	474.159464		14	24000	478.364948	491.447921		0.00%	3.65%
15	32000	476.523796	473.18868		15	32000	476.523796	474.398962		0.00%	0.26%
16	32768	476.104723	477.493634		16	32768	476.704463	477.330606		0.13%	-0.03%
17	36864	477.900663	475.472787		17	36864	477.973279	475.728454		0.02%	0.05%
18	40960	477.707561	475.559064		18	40960	478.445017	476.088067		0.15%	0.11%
19	45056	479.169812	475.865134		19	45056	479.143266	475.878202		-0.01%	0.00%
20	49152	477.804907	475.382982		20	49152	477.868404	475.976377		0.01%	0.12%
21	65536	481.274125	478.171806		21	65536	481.537733	478.703926		0.05%	0.11%
22	66000	481.64652	480.095457		22	66000	481.856013	480.466388		0.04%	0.08%
23	68608	481.745774	479.034704		23	68608	481.917596	478.856209		0.04%	-0.04%
24	80000	483.409361	480.356529		24	80000	483.330481	480.375277		-0.02%	0.00%
25	98304	480.736301	481.396882		25	98304	480.789858	481.320143		0.01%	-0.02%
```

NCU profiler shows lower DRAM fetches with the new kernel:

![image](https://github.com/user-attachments/assets/f3606725-d8fc-4ea5-ae6d-9c188bf32d72)

NCU reports about 12% elapsed time reduction in this kernel alone compared to baseline (and because of other kernels that are run, the overall backward pass time as seen by the user gets reduced by 5%).

I compared the binary size increase by running `python setup.py develop` before and after and diffing the .so files:

![image](https://github.com/user-attachments/assets/8e6cee2e-3c7a-4fa4-8836-954047ce8ffc)

libtorch_cuda.so goes from 274,752,224 bytes to 274,787,072 bytes. The increase in size is 34kB which is about 0.01%.

I measured the compilation time for incremental development:

```
touch ./aten/src/ATen/native/cuda/SoftMax.cu
time python setup.py develop
real    0m10.083s
user    0m8.197s
sys     0m3.149s
```

Note that this uses `ccache` and does a bunch of copies and is not just measuring the `nvcc` time. I measured the `nvcc` time separately by capturing the `nvcc` command shown in [1] below and running it on the baseline and modified kernels:

```
# baseline nvcc time for SoftMax.cu
real    0m35.341s
user    0m33.801s
sys     0m1.289s

# this PR's nvcc time for SoftMax.cu
real    0m36.513s
user    0m34.722s
sys     0m1.408s
```

So the `nvcc` time increases by about 1 second, or ~3% of the baseline.

[1] `nvcc` command is here:
```
# This is the nvcc command
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/torch/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/torch/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145866
Approved by: https://github.com/ngimel
2025-02-12 07:54:41 +00:00
8c80c13b34 [CD] Add python 3.13t build for xpu (#146614)
Fixes #146451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146614
Approved by: https://github.com/atalman
2025-02-12 07:01:36 +00:00
b30bad710d Update octokit/request-action to 2.4.0 (#146940)
The current version 2.1.0 has disappeared since yesterday:

* https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml
* https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml

The latest version is 2.4.0 https://github.com/octokit/request-action
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940
Approved by: https://github.com/izaitsevfb
2025-02-12 05:36:27 +00:00
6105b6f15f Revert "Update octokit/request-action to 2.4.0 (#146940)"
This reverts commit 7aa629f1268f6944eee6e49e43071b4342bf1669.

Reverted https://github.com/pytorch/pytorch/pull/146940 on behalf of https://github.com/huydhn due to This does not work ([comment](https://github.com/pytorch/pytorch/pull/146940#issuecomment-2652691614))
2025-02-12 05:21:43 +00:00
5a1c7c424d Fix standalone runner for CUTLASS auto-tuning backend (#146764)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146764
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #146755
2025-02-12 04:42:08 +00:00
eb655a2d5f Fix CUTLASS 2.x kernels for auto-tuning (#146755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146755
Approved by: https://github.com/henrylhtsang
2025-02-12 04:42:07 +00:00
683bb1242c [export][ez] Update tag_ for union setters. (#146912)
Summary: ez fix to set tag for union type fields.

Test Plan: CI

Differential Revision: D69467715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146912
Approved by: https://github.com/yiming0416
2025-02-12 03:52:36 +00:00
06f8f9a017 Update instructions about faster linker (#146750)
This PR adds instructions to specify linker via cmake env `CMAKE_LINKER_TYPE` and also adds `mold` as a linker alternative.

Since 3.29, cmake introduced [`CMAKE_LINKER_TYPE`](https://cmake.org/cmake/help/latest/variable/CMAKE_LINKER_TYPE.html) that can specify linker without overwriting `ld` file or changing build script.

`mold` is already stable and **the fastest** (afaict) linker out there, and also easier to install compared with `lld`. So I added it here. After switching to `mold`, the time of linking `libtorch_cuda.so` has been reduced from ~7s to ~0.6s locally.

Also note `gold` has been marked deprecated recently[1].

[1] https://lwn.net/Articles/1007541/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146750
Approved by: https://github.com/albanD
2025-02-12 03:14:08 +00:00
28a2ab6b84 Clear CompiledTritonKernel cache after each inductor compile (#146925)
Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in :

```
compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn)
# run compiled_1
out_compiled = compiled_1
compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2)
```

Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher.

Found this bug testing internal inference models.

This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: https://github.com/pytorch/pytorch/pull/143408

Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146925
Approved by: https://github.com/laithsakka, https://github.com/jansel
ghstack dependencies: #146417
2025-02-12 02:38:42 +00:00
0acbf8039a [BE] Unskip some tensor creation tests on Mac (#146952)
Followup after https://github.com/pytorch/pytorch/pull/145367

One should never use skip, but rather xfail otherwise one never knows when test is finally fixed.

`test_float_to_int_conversion_finite` were fixed on MacOS a while back (guess since the time Intel builds were disbaled), while `test_float_to_int_conversion_nonfinite` is fixed by https://github.com/pytorch/pytorch/pull/145367 that selects architecture-appropriate reference values for Arm ISA

Note, that results of floating to integral types cast are undefined if floating point value is outside of integral dynamic range

"Fixes" https://github.com/pytorch/pytorch/issues/38752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146952
Approved by: https://github.com/atalman, https://github.com/seemethere
2025-02-12 01:59:15 +00:00
78ebd3c502 Revert commit that removed windows testing in VS2019-> update (#146920)
This reverts commit b57b38b52ede2af27d4eb1bf6ba63868a3ee7553.

This commit removed windows testing for the VS build and needs to be added back in with the updated VS2022 build

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146920
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet
2025-02-12 01:12:05 +00:00
df5e232563 [BE] Delete NCCL slimming (#146943)
It was added by https://github.com/pytorch/pytorch/pull/35843 and served its purpose when everything was linked statically in libtorch_cuda.so, but for all our releases it's no longer relevant as nccl is now a dynamic dependency of libtorch_cuda.so

Besides,  It does not work with CXX11 ABI anyway, and creates problems with newer version of NCCL, when two `collectvies.o` are package into library archive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146943
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-02-12 00:35:55 +00:00
a58f421f4b [CUDA][CUDNN][SDPA] Pass dropout seed and offset to cuDNN in int64 (#146734)
Workaround for limitation in cuDNN that does not accept dropout seed/offset in `int32` for SM 10.0 kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146734
Approved by: https://github.com/Skylion007
2025-02-12 00:24:38 +00:00
281249ba54 [torch][amdsmi] Avoid ODR violation when loading amdsmi (#146324)
Summary:
amdsmi bundles its own copy of `libamd_smi.so`. When you're interacting with `amdsmi` from *only* python that's fine, but when you try to interact with `libamd_smi.so` from native code too this poses a problem, because from native code you'll be linking against the copy of `libamd_smi.so` from the SDK.

This means you'll end up with 2 copies of `libamd_smi.so` in your process, and potentially (Murphey's law says you will, as does our CI) violate ODR.

In order to avoid this issue from the PT side of the world we can hook the `dlopen("path/to/bundled/libamd_smi.so")` and try to use the already loaded/SDK version of `libamd_smi.so` first, before proceeding to use the `path/to/bundled/libamd_smi.so`.

Test Plan: CI, inspect process using libamd_smi.so from native + python and observe only a single copy loaded

Differential Revision: D69064038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146324
Approved by: https://github.com/malfet
2025-02-12 00:01:02 +00:00
7aa629f126 Update octokit/request-action to 2.4.0 (#146940)
The current version 2.1.0 has disappeared since yesterday:

* https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml
* https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml

The latest version is 2.4.0 https://github.com/octokit/request-action
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940
Approved by: https://github.com/izaitsevfb
2025-02-11 23:50:24 +00:00
f50d359ce2 [ c10d ] modify API to get device string from device with torch.device (#146290)
Modify the ```get_default_backend_for_device()``` API to extract the device string using ```torch.device()```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146290
Approved by: https://github.com/guangyey, https://github.com/H-Huang
2025-02-11 23:30:57 +00:00
3a29992ee6 [associative_scan] Lifted arguments (#140043)
This PR implements lifted arguments for associative_scan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140043
Approved by: https://github.com/ydwu4
2025-02-11 23:25:55 +00:00
f59a56e56f [ARM] Fix test_float_to_int_conversion_nonfinite (#145367)
We have broken tests on Aarch64 which are not enabled upstream, this PR will fix and enable those tests.

```
AssertionError: Tensor-likes are not equal!

Mismatched elements: 2 / 3 (66.7%)
Greatest absolute difference: 1 at index (1,)
Greatest relative difference: 1.0842021724855044e-19 at index (1,)

To execute this test, run the following from the base repo dir:
    python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_float_to_int_conversion_nonfinite_cpu_int64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145367
Approved by: https://github.com/malfet
2025-02-11 22:22:10 +00:00
a20055288f [DTensor][Test] Create a simple unit test for tensordot (#146514)
Fixes #ISSUE_NUMBER

The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
2025-02-11 21:57:56 +00:00
443437648a Revert "Introduce new template heuristic for triton autotune configs (#144985)"
This reverts commit 69301fb10eb3f7fd49af5c681a2e386af115baba.

Reverted https://github.com/pytorch/pytorch/pull/144985 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it needs a small tweak to avoid breaking some internal code ([comment](https://github.com/pytorch/pytorch/pull/144985#issuecomment-2652021045))
2025-02-11 20:42:41 +00:00
b1ff90ae8a remove Windows XPU build workaround. (#144644)
From the RFC: https://github.com/pytorch/pytorch/issues/141946
Fixes https://github.com/pytorch/pytorch/issues/134989

After we land these fixing PRs:
1. https://github.com/pytorch/pytorch/pull/142245
2. https://github.com/pytorch/pytorch/pull/141943

We can remove the Windows XPU workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144644
Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/gujinghui, https://github.com/atalman
2025-02-11 20:39:51 +00:00
664550ecbf [export] Serialize special values of float into strings for json. (#146490)
Summary: Currently inf is serialized as Infinity in JSON which is not standard compliant. Instead we will tweak all special floating points into strings and handle them at json layer.

Test Plan:
see D69060784
CI

Differential Revision: D69186425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146490
Approved by: https://github.com/yiming0416
2025-02-11 20:01:27 +00:00
110638f702 [inductor] skip _test_insignificant_strides on rocm (#146849)
Check https://github.com/pytorch/pytorch/issues/146848 , the rocm kernel for _scaled_dot_product_attention does not match the meta kernel regarding output shape. cuda kernel is fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146849
Approved by: https://github.com/eellison, https://github.com/atalman, https://github.com/jansel
ghstack dependencies: #145904
2025-02-11 19:55:43 +00:00
b18e3c01aa [Inductor] Unifiy Low Precision FP Legalization for to_dtype_bitcast & constant (#144646)
The upcast in `to_dtype_bitcast()` breaks following operations that only works with the target type (I uses `bitwise_and` in the updated UT).
![image](https://github.com/user-attachments/assets/77a6f3b6-b5e7-4ed8-ab65-09d76f077376)

This PR fixes this problem. Let's check the CI results to make sure it doesn't bring accuracy problems.

- Unified the type promotion of low-precision FP operations in the legalize func, grouping ops into sources (whose results may be promoted) and sinks (whose input may be cast back). (The term of _sink_ and _source_ are from [graph theory](https://en.wikipedia.org/wiki/Directed_graph#Indegree_and_outdegree).)

## Test
```bash
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float16_to_int16_cpu
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_bfloat16_to_int16_cpu
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float32_to_int32_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144646
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-11 19:45:04 +00:00
af349047c3 [FlexAttention] Bug fix broken flag (#146872)
# Summary

I somehow broke this... I think claude was trippin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146872
Approved by: https://github.com/BoyuanFeng
2025-02-11 19:42:37 +00:00
ebd992724f Implement serializable getattr support for tensor subclasses (#145772)
builtins.getattr is not serializable, so we replace it with a custom op that has more refined schema.

Differential Revision: [D68899421](https://our.internmc.facebook.com/intern/diff/D68899421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145772
Approved by: https://github.com/bdhirsh
2025-02-11 19:05:14 +00:00
d5d3bdb55a Fix var CUDA_PATH_V128 in cuda128.bat file (#146906)
Followup after: https://github.com/pytorch/pytorch/pull/146653
This should fix upcoming CUDA 12.8 windows builds.
Issue found during pytorch-canary Windows AMI test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146906
Approved by: https://github.com/malfet, https://github.com/tinglvv
2025-02-11 18:43:55 +00:00
c7515da7b0 Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)
This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361

I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534

Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979
Approved by: https://github.com/eqy, https://github.com/eellison
2025-02-11 18:16:15 +00:00
e3839bd603 [BE] Strip #pragma once when embedding the headers (#146871)
This eliminates compiler warning, for example when compiling Metal shader with embedded headers
```
 with program_source:6:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:81:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:588:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:719:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:829:29: error: use of undeclared identifier 'r0_2'
        auto tmp8 = in_ptr2[r0_2 + 768*x0];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146871
Approved by: https://github.com/dcci
2025-02-11 16:49:00 +00:00
861bf892fb Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748
Approved by: https://github.com/atalman
2025-02-11 15:49:01 +00:00
5235a18cd6 [SkipFiles] remove some more stuff from MOD_SKIPLIST (#146876)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146876
Approved by: https://github.com/anijain2305
ghstack dependencies: #146854
2025-02-11 15:00:56 +00:00
fc5913b6bf [StaticRuntime] Fix a bug that memory planner ignores subblocks (#146728) (#146855)
Summary:

When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors.

Differential Revision: D69195886

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855
Approved by: https://github.com/swolchok
2025-02-11 13:59:54 +00:00
cyy
15635b14ce [4/N] Remove unnecessary once flag usage (#146783)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146783
Approved by: https://github.com/albanD
2025-02-11 13:55:06 +00:00
69301fb10e Introduce new template heuristic for triton autotune configs (#144985)
Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in.

This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs:
- CPUConfigHeuristic()
- CUDAConfigHeuristic()
- ROCmConfigHeuristic()
- XPUConfigHeuristic()

These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs.

The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985
Approved by: https://github.com/jansel
2025-02-11 10:48:09 +00:00
229fb0bc83 [Dynamo][autograd.Function] Relax backward speculation strict mode: support .requires_grad (#146742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146742
Approved by: https://github.com/zou3519
ghstack dependencies: #146571, #146741
2025-02-11 05:39:07 +00:00
f2da810516 [Dynamo][autograd.Function] Relax backward speculation strict mode: support .data (#146741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146741
Approved by: https://github.com/zou3519
ghstack dependencies: #146571
2025-02-11 05:39:07 +00:00
29523aa113 [Dynamo][autograd.Function] Relax backward speculation strict mode a bit (#146571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146571
Approved by: https://github.com/zou3519
2025-02-11 05:39:00 +00:00
a7fe384d0e Remove torch._higher_order_ops from MOD_SKIPLIST (#146853)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146853
Approved by: https://github.com/williamwen42
2025-02-11 04:38:26 +00:00
001ebbf734 [MTIA] (4/n) Implement PyTorch APIs to query/reset device peak memory usage (#146751)
Summary: Public summary (shared with Github): This diff updates the unit test for the PyTorch API "reset_peak_memory_stats".

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_reset_peak_memory_stats
```

https://www.internalfb.com/intern/testinfra/testrun/9007199321947161

Reviewed By: yuhc

Differential Revision: D68989900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146751
Approved by: https://github.com/nautsimon
2025-02-11 03:51:48 +00:00
23524699d5 Only call triton in worker process, kick off worker processes earlier, during inductor codegen (#146417)
### Big idea
This PR extends https://github.com/pytorch/pytorch/pull/144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.
### Implementation Overview
In total, the diff does the following:
- Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
- Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
- Extend @eellison's future cache to a class, mostly as a refactor
- Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent
async_compile.triton call that occurs after codegen to cache hit on cold start.
In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.
Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.

Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146417
Approved by: https://github.com/jansel
2025-02-11 03:46:16 +00:00
fe94ece375 Revert "Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791)"
This reverts commit 3d604b17d91b928c850ded83b2ec25ea066bb3f6.

Reverted https://github.com/pytorch/pytorch/pull/141791 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141791#issuecomment-2649717140))
2025-02-11 03:17:59 +00:00
30cbf13544 [PGNCCL] Associate tensor allocation support with NCCL version (#146842)
This is a forward fix to #146589.
For NCCL version lower than 2.19, previous PR would see `RuntimeError: NCCL mem allocator is not supported in this NCCL version`.
This PR gates the support by checking link-time NCCL version via `ncclGetVersion`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146842
Approved by: https://github.com/XilunWu, https://github.com/wconstab, https://github.com/fduwjj
ghstack dependencies: #146589
2025-02-11 02:52:52 +00:00
1d81ecfc54 Rename PrimHOPBase to BaseHOP + minor changes (#146727)
This PR:
- renames PrimHOPBase to BaseHOP
- changes the backward pass to always return a tuple (to match the
  forward pass).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146727
Approved by: https://github.com/ydwu4
2025-02-11 02:43:37 +00:00
275c034b16 [SkipFiles] remove some stuff from MOD_SKIPLIST (#146854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146854
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2025-02-11 01:34:46 +00:00
5205158c1b Replace is_same with is_same_v for concise syntax (#145450)
Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450
Approved by: https://github.com/Skylion007
2025-02-11 01:34:15 +00:00
f38f1dcd82 Revert "move and fix logic to update unbacked bindings (#146115)"
This reverts commit 103c8b44bcb6fbf30b5411c5af19d312427525e7.

Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/huydhn due to This change has been reverted internally D69129334 but the OSS revert failed https://github.com/pytorch/pytorch/pull/146437 ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2649610877))
2025-02-11 01:26:36 +00:00
0c9fdd6cfb [Docs] Fix description of input in torch.addbmm() (#146664)
Fixes #146613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146664
Approved by: https://github.com/mikaylagawarecki
2025-02-11 01:22:09 +00:00
2fafcd37c3 Revert "cpp_wrapper: Precompile device-specific header files (#144002)"
This reverts commit de6efa1feb0e8c9073640a77afdec1a53a477aed.

Reverted https://github.com/pytorch/pytorch/pull/144002 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this breaks some inductor tests running internally ([comment](https://github.com/pytorch/pytorch/pull/144002#issuecomment-2649569562))
2025-02-11 00:42:22 +00:00
d763093b49 [MPS] fix lu factor for large tensors with bs>1 (#146753)
Try this:
```python
import torch

batch_size = 2
A = torch.eye(256, device="mps")[None, :, :].expand(batch_size, -1, -1) + 0.1 * torch.randn((batch_size, 256, 256), device="mps")
A_cpu = A.cpu()
LU_cpu, pivots_cpu = torch.linalg.lu_factor(A_cpu)
LU, pivots = torch.linalg.lu_factor(A)
torch.testing.assert_close(LU.cpu(), LU_cpu)
```
You'll get huge difference in LU tensors
<img width="706" alt="Screenshot 2025-02-08 at 12 14 39" src="https://github.com/user-attachments/assets/b45f2b3c-e0a5-49c8-aa07-42792150b781" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146753
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-11 00:37:07 +00:00
937b41e3b5 Refactoring pipeline parallelism test cases to be device agnostic [1/n] (#146472)
In this series of PR we intend to refactor pipeline parallelism test cases to enable to be completely device agnostic.

These changes will include the following approaches to do the same :

- Allowing for multiple device types using instantiate_device_type_test
- Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies

This should result in improvement in usability for all devices

For this PR we have shown support for the following devices:

- CPU (wherever applicable)
- CUDA
- HPU
- XPU

To add other device new users can simply append their device to the device list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146472
Approved by: https://github.com/H-Huang
2025-02-11 00:13:23 +00:00
b6273d7f4b [ROCm] Update periodic.yml to use 2GPU runners (#146839)
Temporary fix for rocm workflow.
The 4-GPU runners are all taken offline due to (network timeout issue), and so we aren't able to run any periodic jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146839
Approved by: https://github.com/jeffdaily
2025-02-10 23:41:11 +00:00
aa1622c0b6 Support ignoring parameters in FSDP2 (#146631)
Differential Revision: D69153051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146631
Approved by: https://github.com/awgu
2025-02-10 23:20:28 +00:00
c2bf3be011 [inductor] Remove _get_grid_fn_str (#146800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146800
Approved by: https://github.com/yanboliang
2025-02-10 23:14:30 +00:00
0d5fb0941f [cutlass backend] check against arch >= 100 (#145812)
Summary:
Want to add a guard against silent fallback to SM90.

GenerateSM100 was just added 3 days ago. https://github.com/NVIDIA/cutlass/blame/main/python/cutlass_library/generator.py#L8896

It should show up in CUTLASS 3.8 (not pinned yet).

Test Plan: ci

Differential Revision: D68748705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145812
Approved by: https://github.com/chenyang78, https://github.com/ColinPeppler, https://github.com/Aidyn-A
2025-02-10 22:41:08 +00:00
bab35eb26a fix intermediate debug information with cpp_wrapper (#145527)
Summary: before fix, code like:
```cpp
    aoti_torch_print_tensor_handle(buf0, "after_launch - triton_poi_fused_randn_0 - buf0");
    aoti_torch_print_tensor_handle(buf1, "after_launch - triton_poi_fused_randn_0 - buf1");
    printf("[  after_launch - triton_poi_fused_randn_0 - 0: %ld  ]", 0); printf("
");
    printf("[  after_launch - triton_poi_fused_randn_0 - 1228800L: %ld  ]", 1228800L); printf("
");
```
was generated, which is a syntax error.

Test Plan:
New unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145527
Approved by: https://github.com/desertfire
2025-02-10 22:24:26 +00:00
681894546b Fix bazel job after #144489 (#146840)
This is currently failing in trunk with the following error https://github.com/pytorch/pytorch/actions/runs/13246034191/job/36972742610

### Testing

Bazel job passing https://github.com/pytorch/pytorch/actions/runs/13247495161/job/36977571965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146840
Approved by: https://github.com/atalman
2025-02-10 22:17:36 +00:00
652880e840 Fix logging and test files which misspell "precision" (#146113)
Noticed this while working on something, decided to submit a quick fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146113
Approved by: https://github.com/drisspg
2025-02-10 21:54:16 +00:00
e65b89e4cd [Feat]: Improve KleidiAI 4 bit kernel performance (#146476)
Description:
1. New thread blocking accelerates GEMVs
2. We increase throughput of the lhs quant pack + matmul pipeline by decoupling two operations.
3. The new blocking strategy blocks ```out_feature``` to accelerate GEMVs

Perf improvements:
12% speedup in LLM prefill phase and upto 16% speedup in autoregressive phase

Perf Benchmarking : https://github.com/pytorch/pytorch/issues/143289#issuecomment-2545773370

Change-Id: Ie574ff8459fdb75701ae366158b4e118c70694e4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146476
Approved by: https://github.com/malfet
2025-02-10 21:30:57 +00:00
4d626c261b Fix workarea compute in lapackSyevd (#146456)
work-query APIs return floating point values, that could loose precision when converted back to int. Solve this by using `nextafter` and `ceil`
Add regression test

Fixes #145801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146456
Approved by: https://github.com/malfet
2025-02-10 21:29:48 +00:00
8f073065d5 [while_loop][inductor] support sym expression as cond_fn output (#146222)
As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn.

aoti generated output code looks like:
```
V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]     bool buf7_cond_result;
....
(while_loop_cond_graph_0_arg2_1_handle);
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         buf7_cond_result = u0 + u1 < 10L;
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         if (!buf7_cond_result) break;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222
Approved by: https://github.com/desertfire
2025-02-10 21:25:40 +00:00
97d4753bd3 [hop][inductor] don't promote arg type for cond and while_loop (#146660)
Hop subgraph codegen assumes arguments's type are not promoted. Otherwise, we might generate wrong kernel.

Differential Revision: [D69279031](https://our.internmc.facebook.com/intern/diff/D69279031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146660
Approved by: https://github.com/zou3519, https://github.com/eellison
2025-02-10 21:24:52 +00:00
da216baaa2 Optimize inductor Self typing (#146669)
Replace method return type with `Self` typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146669
Approved by: https://github.com/jansel
2025-02-10 20:39:56 +00:00
86b52f4209 Fix lint (#146846)
[Fixes #ISSUE_NUMBER
](https://github.com/pytorch/pytorch/actions/runs/13248382636/job/36980294598)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146846
Approved by: https://github.com/huydhn, https://github.com/clee2000
2025-02-10 20:00:29 +00:00
3d604b17d9 Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791)
As upsample_bilinear2d.vec is a core ATen op, it should not be decomposed by default in the export path. Because the operator has CompositeImplicitAutograd dispatch, its decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.
In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining three ops: upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.

Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and un-decomposite it, but this is no longer necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141791
Approved by: https://github.com/tugsbayasgalan, https://github.com/digantdesai
2025-02-10 19:30:19 +00:00
97f6480cf5 Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467)
Fixes https://github.com/pytorch/pytorch/issues/146416

Also added contiguity checks in the C++ functional collective ops to prevent striding issues introduced during compilation manifest as silent correctness issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146467
Approved by: https://github.com/Chillee, https://github.com/lw, https://github.com/shunting314
2025-02-10 19:15:49 +00:00
3822a88d21 [symbolic shapes] Log symnode id (#146583)
We want to log the symnode id which will help us with provenance tracking between expressions created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146583
Approved by: https://github.com/bobrenjc93
2025-02-10 19:13:06 +00:00
b45e6fa707 Cleanup VS 2019 refs in pytorch (#145863)
Related to: https://github.com/pytorch/pytorch/issues/128835
Follow up on PR: https://github.com/pytorch/pytorch/pull/145319
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145863
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn, https://github.com/atalman
2025-02-10 19:05:35 +00:00
c02a1ecc1d [export][ez] Allow math.trunc for serialization. (#146715)
Summary: as title.

Test Plan: CI

Differential Revision: D69317084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146715
Approved by: https://github.com/angelayi
2025-02-10 19:05:07 +00:00
9b7d050600 Move capture_provenance to make_node_impl (#146625)
Previously we were only logging `make_user_impl` implementations, which only gets triggered for operations done on python SymInts, not cpp SymInts. Instead `make_node_impl` will get triggered for both python and cpp SymInt operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146625
Approved by: https://github.com/bobrenjc93
2025-02-10 19:00:51 +00:00
0486a996d2 [sigmoid] Implement a OSS only model runner. (#146440)
Summary: Implement an oss version of modelrunner with clean dependencies. The new oss model runner only removes thrift and only use json header to load the model.

Test Plan: Test will be added in the next diff separately. (D69060784)

Differential Revision: D68846877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146440
Approved by: https://github.com/SherlockNoMad
2025-02-10 18:54:05 +00:00
519f547d05 windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-02-10 18:34:59 +00:00
ad847da0cf [cutlass backend] fix bug for accuminator dtype (#146356)
Will add unit tests for accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356
Approved by: https://github.com/Chillee
2025-02-10 18:20:58 +00:00
ddcc97bb8c Make sure cutlass kernel .cu file has configuration name and nvcc compile command (#146668)
I think its good to have everything in the .cu file. Especially the nvcc compile command.

Technically, the configuration name can be found in the template already. So let me know if you think its not needed.

Differential Revision: D69281295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146668
Approved by: https://github.com/chenyang78
2025-02-10 18:16:44 +00:00
6b3f51f870 use None to slice when list has one element only (#146638)
When autotune_num_choices_displayed is None and the list of choices has length 1, slicing with `[:-1]` means getting all elements except the last one, which resulted in an empty list.

Slicing with `[:None]` works.

Differential Revision: D69265168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146638
Approved by: https://github.com/drisspg
2025-02-10 18:15:45 +00:00
374b762bbf [ez][BE] get rid of the extra printf('\n') (#146726)
Summary: as title

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```

Differential Revision: D69328701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146726
Approved by: https://github.com/ColinPeppler
2025-02-10 17:45:55 +00:00
5fd15a04b7 [ROCm] Enable inductor-periodic testing for MI300 (#144594)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144594
Approved by: https://github.com/malfet, https://github.com/huydhn

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-10 17:42:09 +00:00
b8261358ca Revert "windows Magma build for cu128 (#146653)"
This reverts commit d0e70c4fd33d9accca2c66203c19372733a83ea1.

Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/jeanschmidt due to Seems to have broken some windows tests, reverting to see if it gets green ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2648769150))
2025-02-10 17:36:32 +00:00
cbbb11d967 [dynamo][user-defined] Unify standard and non-standard __new__ codebase (#146737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146737
Approved by: https://github.com/jansel
ghstack dependencies: #146677
2025-02-10 17:31:13 +00:00
ee8a06f1f6 [dynamo][user-defined] User class.__new__ instead of special casing (#146677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146677
Approved by: https://github.com/jansel
2025-02-10 17:31:13 +00:00
de6efa1feb cpp_wrapper: Precompile device-specific header files (#144002)
This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone.

Differential Revision: [D69185685](https://our.internmc.facebook.com/intern/diff/D69185685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144002
Approved by: https://github.com/desertfire
2025-02-10 17:13:09 +00:00
3cadce7af2 [NJT] Fix inference mode for composite implicit ops without nested-specific kernel (#146633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146633
Approved by: https://github.com/jbschlosser
2025-02-10 16:59:48 +00:00
dfe3b64282 [mps] Implement eager support for spherical_bessel_j0 (#146818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146818
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-10 16:58:05 +00:00
5f621c5879 [MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#146710)
Summary: Public summary (shared with Github): This diff implements a C++-Python binding to enable `reset_peak_memory_stats`.

Test Plan: The test is implemented in the following diff.

Reviewed By: yuhc

Differential Revision: D68988673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146710
Approved by: https://github.com/nautsimon
2025-02-10 16:57:09 +00:00
68c9e22ef7 FSDP: avoid resetting version counter of all_gather_output in inference_mode (#146709)
Summary:
FSDP needs to hide VC bumps on its allgather buffer, but it does not need to do this is the allgather buffer was generated under inference mode.

more details here: https://www.internalfb.com/diff/D69115649?dst_version_fbid=1316814572779281&transaction_fbid=849120230625711

Test Plan: CI

Differential Revision: D69311496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146709
Approved by: https://github.com/awgu
2025-02-10 16:56:40 +00:00
6aa924af68 Revert "[ONNX] Create deprecation warning on dynamo_export (#146425)"
This reverts commit 41e6d189a39a40b237ab9b9ab195cec1194b331b.

Reverted https://github.com/pytorch/pytorch/pull/146425 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/146425#issuecomment-2648472579))
2025-02-10 15:54:34 +00:00
1557b7bf9a Revert "[ONNX] Adjust and add deprecation messages (#146639)"
This reverts commit 63c2909ae3e293dee96bca5af88bc51d8ca0ce10.

Reverted https://github.com/pytorch/pytorch/pull/146639 on behalf of https://github.com/atalman due to Sorry Need to revert https://github.com/pytorch/pytorch/pull/146425 ([comment](https://github.com/pytorch/pytorch/pull/146639#issuecomment-2648465047))
2025-02-10 15:51:52 +00:00
a36c22f2ed futher scheduler changes for invoke_quant: prologue low prec, (slightly) more aggressive fusion (#145104)
Respect invoke_quant low precision options, also, be more aggressive in attepmting fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145104
Approved by: https://github.com/shunting314, https://github.com/jansel
ghstack dependencies: #139102
2025-02-10 15:50:19 +00:00
899066eedf Fix round(...) with constants (#146495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146495
Approved by: https://github.com/anijain2305
2025-02-10 15:08:09 +00:00
611ca163fd [MPS] Add bilineard2d_aa implementation (#145526)
Interesting quirk of the algorithm, that is not very well documented, is that value of align_corners is ignored in antialias mode, see arguments of
e8304f08fe/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L747-L751)

Error out on  uint8 implementation(as it relies on a very fragile integer integer arithmetic), as it's not implemented on any other Accelerator devices at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145526
Approved by: https://github.com/dcci
2025-02-10 15:03:14 +00:00
d0e70c4fd3 windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-02-10 13:48:55 +00:00
6f15a609d3 Test typing of arithmetic operators on Tensor (see #145838) (#146426)
See #145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146426
Approved by: https://github.com/Skylion007
2025-02-10 12:19:56 +00:00
c24038025d [ROCm] Unskip std:bad_alloc failures (#146407)
Flakey MI300 issue related to memory usage should now be resolved after https://github.com/pytorch/pytorch/actions/runs/13007160888?pr=145829.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146407
Approved by: https://github.com/jeffdaily
2025-02-10 11:01:56 +00:00
c88ae00692 fix: replace stderr with stdout for download messages in hub.py (#146475)
This PR addresses an issue where download logs in `hub.py` are sent to `stderr` instead of `stdout`. Hence, when running models with workers, these messages are incorrectly categorized as errors, leading to confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146475
Approved by: https://github.com/mikaylagawarecki
2025-02-10 10:46:10 +00:00
6667e5d786 [dim order] solve broken doc (#146641)
Differential Revision: [D69265340](https://our.internmc.facebook.com/intern/diff/D69265340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146641
Approved by: https://github.com/svekars, https://github.com/Jack-Khuu
2025-02-10 07:51:26 +00:00
c4d835fbab [DTensor][conv] add DTensor convolution_backward op support for case where the input Tensor has requires_grad=False (#142278)
Fixes #142058

## Summary
DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model.

ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above.

However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent.

## Fix
allow the `grad_input` to be `None` for `convolution_backward` op.

## Test
`pytest test/distributed/tensor/test_convolution_ops.py`

## Follow-up
The current implementation of DTensor conv op also ignores `output_mask` and this may need further care.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278
Approved by: https://github.com/bdhirsh
2025-02-10 07:06:40 +00:00
effc545274 [DDP] Use NCCL allocated memory for gradient bucket (#146589)
So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications.
Less SM usage, less memory contention between NCCL kernel and compute kernels.

Added env `DDP_DISABLE_COMM_MEM` as a back-out option:
```
An environment variable to disable comm-optimized memory pool.
Default is 0, which means comm-optimized memory pool is enabled.
Users can set it to 1 in case of seeing regression or OOM (because this
comm MemPool may not share space with regular compute MemPool).
```

Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589
Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj
2025-02-10 05:23:11 +00:00
387c993c3b [ca] remove private API: _compiled_autograd_should_lift (#146720)
Since the functional autograd + compiled autograd migration, we don't trace into nodes anymore, and everything is lifted. We can't support this flag which tries to inline make_fx style in CA initial pass. There's no more usage internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146720
Approved by: https://github.com/zou3519
2025-02-10 04:29:57 +00:00
e8304f08fe Fix torch.take_along_dim param type and default description (#146474)
## Changes

- Change type description to `LongTensor`, consistent with [`torch.take`](https://pytorch.org/docs/stable/generated/torch.take.html)
- Add `dim` param default value description

## Test Result

**Before**
![image](https://github.com/user-attachments/assets/720ce158-2bc1-48b5-a188-56fcc7188d96)

**After**
![image](https://github.com/user-attachments/assets/05fe20bd-9476-4b97-ac2b-9b161d6532a1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146474
Approved by: https://github.com/mikaylagawarecki
2025-02-10 01:19:30 +00:00
298226f358 [dynamo] check for incompatible configs (#146513)
internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/

Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time.

Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513
Approved by: https://github.com/williamwen42

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
2025-02-10 00:44:23 +00:00
2a55311773 [cuda] Simplify the sinc function a bit. (#146774)
`else` after `return` can be removed & the indentation can be reduced, for readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146774
Approved by: https://github.com/malfet
2025-02-09 20:09:34 +00:00
b133907d0a Update strided test to float32 (#146748)
Fixes #146377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146748
Approved by: https://github.com/BoyuanFeng, https://github.com/leijurv
2025-02-09 17:41:35 +00:00
91c4bf39d3 [mps] Add a shader for spherical_bessel_j0. (#146771)
In preparation for adding the operation to inductor/eager.
Adapted from the CUDA version of the shader.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146771
Approved by: https://github.com/malfet
2025-02-09 05:11:17 +00:00
0e83e7d56e [EZ] Add logic to build Metal shader with debug info (#146768)
By appending `-frecord-sources -gline-tables-only` to the compilation command

Helpful when debugging shaders compiled into libtorch

Test plan: Run
`python ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal ../aten/src/ATen/native/mps/operations/UpSample.mm`
And then run following to capture shader and check that it contains debug info
```python
import torch
import os
os.environ["MTL_CAPTURE_ENABLED"]="1"
inp = torch.rand(size=(6, 3, 10, 20), device="mps", dtype=torch.float32)
with torch.mps.profiler.metal_capture("bilinear2d"):
    out = torch.nn.functional.interpolate(x, scale_factor=(1.7,0.9), mode="bilinear")
```
<img width="769" alt="image" src="https://github.com/user-attachments/assets/e0316c1c-07a4-4da5-97b9-886c56857c1d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146768
Approved by: https://github.com/dcci
2025-02-08 23:40:23 +00:00
6a9a02acbe Set enable_faithful_generator_behavior flag to True (#142513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142513
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420, #145223
2025-02-08 22:42:12 +00:00
580a305681 Raise MutationError if there are side effects when returning generator (#145223)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145223
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420
2025-02-08 22:42:12 +00:00
68cfd36c11 Add CLEANUP_THROW bytecode (#144420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144420
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424
2025-02-08 22:42:12 +00:00
53ab82d8f5 Implement generator.throw(exception) (#144424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144424
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423
2025-02-08 22:42:12 +00:00
8ee095f7c1 Implement generator.close() (#144423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144423
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422
2025-02-08 22:42:12 +00:00
ca9b16e070 Implement generator.send(..) (#144422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144422
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421
2025-02-08 22:42:12 +00:00
d798831167 Implement generator.__iter__() (#144421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144421
Approved by: https://github.com/zou3519
ghstack dependencies: #141055
2025-02-08 22:42:12 +00:00
8603a1c870 Suport generators (#141055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141055
Approved by: https://github.com/zou3519
2025-02-08 22:42:12 +00:00
ade8fee512 Use c10 version of half/bfloat16 in executorch (#144111)
Summary:
X-link: https://github.com/pytorch/executorch/pull/7040

Accomplished by importing relevant files from c10 into
executorch/runtime/core/portable_type/c10, and then using `using` in
the top-level ExecuTorch headers. This approach should keep the
ExecuTorch build hermetic for embedded use cases. In the future, we
should add a CI job to ensure the c10 files stay identical to the
PyTorch ones.
ghstack-source-id: 260047850
exported-using-ghexport

Test Plan: builds

Differential Revision: D66106969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111
Approved by: https://github.com/malfet
2025-02-08 22:40:14 +00:00
92b7e610ab [Inductor changes] Invoke Quant (#139102)
Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0).

The primary motivations are

- Unifying scattered reasoning for quant operators throughout the code base

- Easy of pattern matching - see this very large pattern match expression [here](949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426). Compared to the pattern I have in the tests:

```
        @register_graph_pattern(
            CallFunction(
                torch.ops.aten.mm,
                CallFunction(
                    torch.ops.higher_order.invoke_quant,
                    Ignored(),
                    Ignored(),
                    Ignored(),
                    scheme="nf4",
                ),
                Arg(),
            ),
            pass_dict=test_pass,
        )
```

- Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul.

Example graph:

``` Python
 ===== AFTER POST GRAD =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
         # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self)  # type: ignore[call-arg]
        repeated_subgraph0 = self.repeated_subgraph0
        invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4');  repeated_subgraph0 = arg0_1 = arg1_1 = None
        return (invoke_quant,)

    class repeated_subgraph0(torch.nn.Module):
        def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
             # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self)  # type: ignore[call-arg]
            mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1);  arg0_1 = None
            add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1);  mul = arg1_1 = None
            return add
```

The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, *args, scheme=None)` where the scheme will not always be present.

I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing.

```
invoke_quant = InvokeQuant(codegen_low_precision=True)
invoke_quant(gn, (x, y), scheme="nf4")
```
Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162.

Feedback welcome.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102
Approved by: https://github.com/Chillee
2025-02-08 19:30:19 +00:00
a1bfb39a31 [Inductor] Expand Identity ops prior to block pattern matching (#146000)
# Feature

Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s.

This PR adds a few features to achieve this invariance.
 - Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression.
 -  Preprocess the expression with this expansion prior to pattern matching.
 - Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`.

# Test plan
This PR adds a few new unit tests:
 - Added a unit test specifically for `expr.expand(identity=True)`.
 - Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops.

I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-08 18:11:53 +00:00
eee5622b98 [inductor] Pre-populate cache for simplify_with_ranges return value (#146373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257, #146282, #146297
2025-02-08 18:00:49 +00:00
c098385cb3 [inductor] Refactor CaptureIndexing into global scope (#146297)
And inline SimplifyIndexing into it CaptureIndexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257, #146282
2025-02-08 18:00:49 +00:00
d35f6b2339 [inductor] Minor compile time optimizations in DefaultHandler (#146282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257
2025-02-08 18:00:40 +00:00
06604c4ec1 [inductor] Refactor op handlers part 5 (#146257)
This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255
2025-02-08 18:00:30 +00:00
403db2faee [inductor] Refactor op handlers part 4 (#146255)
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.

Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254
2025-02-08 18:00:17 +00:00
0e31e5932b [inductor] Refactor op handlers part 3 (#146254)
Fixes type errors that arise from typing `V.ops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254
Approved by: https://github.com/shunting314
ghstack dependencies: #146252
2025-02-08 18:00:08 +00:00
71498aeae3 [inductor] Refactor op handlers part 2 (#146252)
This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252
Approved by: https://github.com/yanboliang
2025-02-08 18:00:00 +00:00
46e83bb637 Fix linter F821 error (#146665)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-08 07:19:37 +00:00
a3ca5c7f4e remove incorrect warnings from min/max documentation (#146725)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146725
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-02-08 05:10:08 +00:00
63c2909ae3 [ONNX] Adjust and add deprecation messages (#146639)
Adjust and add deprecation messages to torch.onnx utilities and verification methods because they are only related to torch script and are obsolete.

Removed unused `_exporter_states.py` and removed the internal deprecation module in favor of the typing_extensions deprecated decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146639
Approved by: https://github.com/titaiwangms
2025-02-08 05:09:16 +00:00
2328dcccb9 [MPSInductor] Implement Welford reduction (#146703)
Still work in progress, though fallback works as expected, but custom shader is not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703
Approved by: https://github.com/jansel, https://github.com/dcci
2025-02-08 05:00:00 +00:00
69feef5a94 Fix broken meta function for flex-attention backwards (#146563)
# Summary

Fixes https://github.com/pytorch/pytorch/issues/146377

So what was the original problem: we were codegening a really weird epilogue:

```Python
        # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM]
        # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM]
        xindex = index_k + 64*index_n + 64*off_hkv*ks2 + 128*off_zq*ks2
        tl.store(out_ptr0 + (tl.broadcast_to(index_k + 64*index_n + off_hkv*ks1, dk.shape)), dk, mask)
        x5 = (xindex % ks3)
        tmp2 = tl.load(out_ptr0 + (x5 + ks1*off_hkv), mask, eviction_policy='evict_last')
        tl.store(out_ptr1 + (tl.broadcast_to(xindex, dk.shape)), tmp2, mask)
 ```

 This epilogue was writing and then reading from overlapping regions of memory causing a race condition.

 ### Why were we generating this epilgoue

 During the lowering we created a buffer w/ a different size/stride from the expected return strides. I :think this added an implicit node (for doing the permutation of this wrongly strided output to the the expected one from the meta func. The scheduler for some reason thought it was okay to fuse this into the epilogue, tbh I dont know why.

 This fixes the broken meta func and the original repro. I will add a test but it is hard to pop, better than nothing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146563
Approved by: https://github.com/Chillee
2025-02-08 04:13:52 +00:00
9c78fb920d Fix assertion failure in gemm template lowering (#146353)
Summary:
This commit fixes a crash in the gemm template lowering caused by hitting an [assert](fd515e4f59/torch/_inductor/codegen/common.py (L1181)) that a buffer was previously removed.

The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set.

The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings.

Differential Revision: D68814625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146353
Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475
2025-02-08 01:52:20 +00:00
cyy
6cb2f737ee Enable Windows tests (#146666)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146666
Approved by: https://github.com/albanD
2025-02-08 00:55:20 +00:00
0ab67299c3 [MPS] lu unpack (#146681)
Implements lu unpack function on MPS. Haven't added new tests because they are covered by removing the lu_unpack from UNIMPLEMENTED_XFAILLIST in test_mps with `test_output_match` function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146681
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-08 00:16:17 +00:00
803661526e Update ET pin to 41e7ffa (#145831)
ExecuTorch pin is failing to update due to a change in the executorch install scripts. The previous install_requirements.sh now only installs dependencies and does not build ET. There is a new script - install_executorch.sh, which both installs dependencies and builds the framework.

This PR updates the relevant CI logic to use install_executorch.sh and bumps the pin forward. This should fix the stuck ET pin.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145831
Approved by: https://github.com/metascroy
2025-02-07 23:52:20 +00:00
dcac3c3e06 [MTIA] (2/n) Implement PyTorch APIs to query/reset device peak memory usage (#146659)
Summary:
Public summary (shared with Github): This diff implements the correct version of the PyTorch API "max_memory_allocated".

Nit: The file previously contained two unit tests with the same name (due to wrong revert); I deleted a deprecated one to revamp the correct version.

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated
```

https://www.internalfb.com/intern/testinfra/testrun/12103424065182810

Reviewed By: yuhc

Differential Revision: D68988435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146659
Approved by: https://github.com/nautsimon
2025-02-07 23:06:35 +00:00
fa34128435 revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453)
Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass.

Test Plan:
With the change, build of APS model using rcclexp can now pass:
`sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0`

Reviewed By: c-p-i-o

Differential Revision: D69149588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453
Approved by: https://github.com/c-p-i-o
2025-02-07 22:43:52 +00:00
103c8b44bc move and fix logic to update unbacked bindings (#146115)
Summary:
Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here).

This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde.

Test Plan: added test

 D68880766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115
Approved by: https://github.com/pianpwk
2025-02-07 22:41:19 +00:00
45d35f5f5a Clean up op BC check list (#146577)
Summary: Remove the expired ones

Test Plan: ci

Differential Revision: D69226556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146577
Approved by: https://github.com/hl475
2025-02-07 22:40:49 +00:00
908133f682 [TreeSpec] Add custom comparision function (#146442)
Summary:
https://github.com/pytorch/pytorch/pull/145815 used caching to for treespec_loads calculation to speed up AOTI module call.

However, this made tests flaky due when comparing TreeSpec for objects in local scope. ie. 'test_export.TestExport.test_pytree_register_nested_data_class.<locals>.Inner'

Type comparison will yield False when local scopes are different due to lru_cache.

Since this comparison is only used for testing purpose, we will only test if str(type) are equal.

Test Plan:
```
PYTORCH_TEST_WITH_ROCM=1 python test/export/test_retraceability.py
```

Differential Revision: D69137706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146442
Approved by: https://github.com/angelayi
2025-02-07 22:39:21 +00:00
91dfa82981 [FlexAttention] Fix dynamic shapes in max-autotune (#146657)
# Fixes
https://github.com/pytorch/pytorch/issues/146624

### Updated

From offline discussion going w/ sizehint

However this does incur guards. I couldn't really think of a fancy way to do this. I was going to do `V.graph.sizevars.size_hint` w/ some default for num blocks, but we ultimately need some information about the input.

I am also not sure if size_hint is ALWAYS guaranteed to return the runtime value. I think it would be okay to not supported unbacked symints (maybe).

 For instance, in the repro, we quickly hit the recompile limit.
 ```Shell
 torch._dynamo hit config.recompile_limit (8)
   function: 'flex_attention' (/home/drisspg/meta/pytorch/torch/nn/attention/flex_attention.py:1161)
   last reason: 0/0: tensor 'L['key']' size mismatch at index 2. expected 1, actual 546
To log all recompilation reasons, use TORCH_LOGS="recompiles".
To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146657
Approved by: https://github.com/Chillee, https://github.com/yanboliang
2025-02-07 22:34:28 +00:00
579b9f2ed9 [inductor] Better exception error messages for cache_on_self (#146652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146652
Approved by: https://github.com/yanboliang
2025-02-07 21:22:21 +00:00
04ce02182b [inductor] Use index_dtype (int32/int64 depending on size) for argmax accumulators (#146651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146651
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-02-07 21:21:21 +00:00
80a1696679 Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 5f0901e57341eb9865102c1caa3d986a0c4ae3bd.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))
2025-02-07 21:04:23 +00:00
206ad9f4ad [cutlass backend] Set no fallback to aten, disabled a few broken tests, default to test on H100 (#146554)
This PR does a few things:
* set fall back to aten to False for most tests. Without this, a lot of tests would fail silently since they just use aten
* Disable two subprocess related broken tests. They would crash in subprocess. More investigation needed.
* remove/disable the tests on A100. Let me elaborate a bit more.

There are two types of A100 tests.
* normal tests that also test A100. e.g., mm, addmm, bmm. However, since the shift to cutlass 3x, they don't work anymore. GenerateSM80 would generate ops that use cutlass 2x, but they get filtered out since they are of GemmKind.Universal but only GemmKind.Universal3x are supported in the 3x template.
* tests for A100 only. The mixed mm and sparse semi structure tests are failing due to "TypeError: can't multiply sequence by non-int of type 'str'" for a while. Disabled them for now. Do let us know if you are about them @alexsamardzic

Differential Revision: D69209929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146554
Approved by: https://github.com/chenyang78
2025-02-07 19:59:28 +00:00
f17109bd96 Revert "windows Magma build for cu128 (#146653)"
This reverts commit 9e27d36e2b2a4f037a7e448c2f87a9ebb0d6e628.

Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2643882976))
2025-02-07 19:37:16 +00:00
bc0191802f [inductor] add size-asserts for fallback ops (#145904)
Fix https://github.com/pytorch/pytorch/issues/144717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904
Approved by: https://github.com/jansel
2025-02-07 18:44:32 +00:00
b60f630de8 fuzzer: disable "fail_on_recompile_limit_hit" and "suppress_errors" (#146650)
Summary:
needed for https://github.com/pytorch/pytorch/pull/146513
Test Plan:
the existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146650
Approved by: https://github.com/xmfan
2025-02-07 18:25:00 +00:00
9e27d36e2b windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman
2025-02-07 18:09:30 +00:00
23af9dde4d distributed/serialization: add experimental streaming torch.save/load methods (#146555)
Summary:

This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor.

This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility.

Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata.

Adapted from:

https://github.com/pytorch/torchft/pull/101

https://github.com/pytorch/torchft/pull/54

Test Plan:

pytest test/distributed/test_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555
Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki

Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>
2025-02-07 18:08:11 +00:00
68631f6e87 PyWork: preserve Python reference counting when used in functional collectives (#146376)
@fegin  found an issue where torchft is not compatible with functional collectives.

Found in https://github.com/pytorch/torchtitan/pull/806

The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug.

PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists

To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects.

To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup.

Test plan:

```
cd pytorch
pytest test/distributed/test_c10d_functional_native.py
```

```
cd torchft
pytest torchft/process_group_test.py -k functional -v -x -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376
Approved by: https://github.com/yifuwang
2025-02-07 18:07:53 +00:00
76c8a2dc48 Fix get_top() to return the base level event of the stack, not the most recently started event (#146649)
`get_top()` is really confusing when talking about a stack, because it can mean the most recently started event on the stack or the toplevel event in perfetto(which displays the stack upside down). Rename to `get_outermost` and fix the bug associated with it,  so that it returns the correct value out of the stack.

Running nanogpt now puts `guard_latency_us` correctly in the `dynamo` event:
```
tlp python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only nanogpt --amp --cold-start-latency --print-compilation-time --training --performance 2>&1 --dynamic-shapes | tee out.log
```
<img width="1281" alt="image" src="https://github.com/user-attachments/assets/4eeb371a-4d81-415a-acc4-7d303a4b2a93" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146649
Approved by: https://github.com/masnesral, https://github.com/anijain2305
2025-02-07 18:04:50 +00:00
f138b18d18 [inductor/profiler] add kernel kwargs instrumentation (#145573)
## About

As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc.

## Test program

Note, install triton before proceeding (pip install triton)

triton_test.py>>>
```
import torch
from torch.profiler import profile, ProfilerActivity

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b

def main():
    x = torch.randn(10, 10).cuda()
    y = torch.randn(10, 10).cuda()
    opt_foo = torch.compile(foo)
    z = opt_foo(x, y)

    # Profile the kernel function on the GPU
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True
    ) as prof:
        z = opt_foo(x, y)

    # Export the trace to a file
    prof.export_chrome_trace("my_kernel_trace.json")

if __name__ == "__main__":
    main()
```

Run it and we should get a trace file my_kernel_trace.json

Output has triton event with the kernel_kwargs attribute.
```
  {
    "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815,
    "ts": 2045246693014.959, "dur": 75.662,
    "args": {
      ...
      "kernel_backend": "triton",
      "num_warps": 4,
      "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)",
      "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py",
      "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor"
    }
  },
```

## Unit Test
Updated unit test:
```
pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573
Approved by: https://github.com/davidberard98, https://github.com/jansel
2025-02-07 17:44:30 +00:00
ee45ea599d [dynamo] Actionable message on recompilations for fullgraph=True (#146550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146550
Approved by: https://github.com/zou3519, https://github.com/StrongerXi
ghstack dependencies: #146553
2025-02-07 17:28:43 +00:00
fa0956951c [dynamo] Remove the suggestion to use suppress_errors on compiler error (#146553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146553
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-02-07 17:28:43 +00:00
cyy
25aa7ca62d Cleanup CallOnce.h (#146700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700
Approved by: https://github.com/albanD
2025-02-07 16:44:45 +00:00
076717785c Revert "[while_loop][inductor] support sym expression as cond_fn output (#146222)"
This reverts commit 5ecdc428b230ab5ba44a90678f1c905e314f6ccb.

Reverted https://github.com/pytorch/pytorch/pull/146222 on behalf of https://github.com/atalman due to Internal failure, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146222#issuecomment-2643379933))
2025-02-07 16:19:41 +00:00
eqy
5d7532140f [CUDA][CUDA Graphs] Fix debug mode warning message (#145996)
The real method is `enable_debug_mode()`, `_cuda_enable_graphs_debug_mode` does not exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145996
Approved by: https://github.com/ptrblck, https://github.com/eellison
2025-02-07 08:04:49 +00:00
002accfb8d Check meta strides for expanded dims in effn_attn_bias (#146054)
With the `_scaled_dot_product_efficient_attention.default`, we have lowering logic to realize the bias to specific alignment constraints. Some of the dims can be expanded, and we need to keep the stride of that dim to 0 to avoid materializing a larger tensor than we need. Previously, we had checked stride of tensor, but if it is not realized, that will not work. so we should check the strides of the meta as well.

Note: getting the exact of realizing/slicing/requiring_exact_strides was a little tricky. I commented to @exclamaforte on an example unable-to-fuse message you get if you do it incorrectly.

Fix for https://github.com/pytorch/pytorch/issues/145760

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146054
Approved by: https://github.com/shunting314
2025-02-07 06:35:57 +00:00
71e8a2bda4 Expand inductor codegen dtype asserts, fix scan (#146067)
We were codegening intermediary dtype asserts in some places but not all. expands assertions, fixes newly failing assertion in

`TORCHINDUCTOR_COMPILE_THREADS=1 TORCH_LOGS="output_code" PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_logcumsumexp_cuda_float16` for scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146067
Approved by: https://github.com/shunting314, https://github.com/jansel
2025-02-07 06:35:47 +00:00
cyy
f6bd20e8a2 Enable TemporaryFileName tests on Windows (#146311)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146311
Approved by: https://github.com/albanD
2025-02-07 06:06:18 +00:00
1c872803cb [export][dynamic shapes] log provenance for locals & symbols for non-strict (#143378)
Adds `dtrace_structured` logging so when a guard or real-tensor propagation assert is added, the relevant user code with local symbolic values & free symbols are logged, e.g. from the draft export CLI report (soon to be added to tlparse):
1. Guard added:
```
1. Constraint violation error.
    The specified input dynamic_shapes spec was found to be incorrect during tracing.
    Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}.
    This occured at the following stacktrace:
        File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 267, in forward:
            assert a.shape[0] == 3

        Locals:
            a: Tensor(shape: torch.Size([s0, 3]), stride: (3, 1), storage_offset: 0)

        Symbols:
           s0: L['args'][0][0].size()[0]
...
```

2. Real tensor propagation:
```
1. Data dependent error.
    When exporting, we were unable to evaluate the value of `u2 < 0`.
    This was encountered 8 times.
    This occurred at the following stacktrace:
        File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 217, in forward:
            return res[:c_item]

        Locals:
            res: Tensor(shape: torch.Size([u0, u1]), stride: (Max(1, u1), 1), storage_offset: 0)
            c_item: u2
...
```

Currently the values are extracted from the traceback, and are only valid for non-strict; strict seems to require storing & fakifying locals in the frames reporting by `TracingContext`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143378
Approved by: https://github.com/avikchaudhuri, https://github.com/bobrenjc93
2025-02-07 05:46:05 +00:00
bc40ccf6aa [BE]: Inline special functions for MPS (#146627)
These header functions should be inlined for consistency and to avoid translation unit / symbol issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146627
Approved by: https://github.com/dcci
2025-02-07 05:15:15 +00:00
ecf44d1002 Fixed a typo in dataset.py (#146600)
Changed word 'Mult' to 'Multi'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146600
Approved by: https://github.com/Skylion007
2025-02-07 05:09:51 +00:00
41e6d189a3 [ONNX] Create deprecation warning on dynamo_export (#146425)
Reland #146003

Deprecation of `torch.onnx.dynamo_export`:

* [`torch/onnx/_internal/_exporter_legacy.py`]: Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146425
Approved by: https://github.com/titaiwangms, https://github.com/atalman
2025-02-07 04:20:46 +00:00
cyy
fa0592b568 Remove some NOLINT (#146610)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146610
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-07 01:50:06 +00:00
624d94bdb8 [MPS] Extend torch.special.sinc to complex (#146648)
And to integral data types as well

Was too lazy to deduce the formula myself(or write a sympy script), but ChatGPT did a decent job of doing it, though it forgot that input must be multiplied by $$\pi$$:
```math
\text{Re}\left(\text{sinc}(x + i y)\right) = \frac{\sin(x)\cosh(y) x - \cos(x)\sinh(y) y}{x^2 + y^2}
```
```math
\text{Im}\left(\text{sinc}(x + i y)\right) = \frac{\cos(x)\sinh(y) x + \sin(x)\cosh(y) y}{x^2 + y^2}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146648
Approved by: https://github.com/dcci
2025-02-07 01:12:37 +00:00
9ea1823f96 [ROCm][Windows] Remove external linkage from an anonymous namespace (#146607)
Fixes a clang-cl compiler error related to attempt to export a symbol that doesn't have any external linkage, since its declared within a local anonymous namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146607
Approved by: https://github.com/jeffdaily
2025-02-06 23:48:20 +00:00
3379c65de6 [ROCm][Windows] Fix unrecognized _BitScanReverse intrinsic (#146606)
Since PyTorch with ROCm on Windows is built with clang-cl and not MSVC, the intrinsics used are different and hence an attempt to compile with `_BitScanReverse` fails. However, a call to `__builtin_clz` which follows in the subsequent preprocessor branch is correctly recognized by the clang-cl compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146606
Approved by: https://github.com/jeffdaily
2025-02-06 23:47:18 +00:00
0d8fc00e0a [ROCm][Windows] Fix isnan integer overload errors on MS STL (#146605)
Microsoft's STL has a problem with integer overloads of std::fpclassify used by std::isnan and std::isinf. These functions need a cast to double to function correctly. Otherwise, the call fails with "ambiguous call to overloaded function" error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146605
Approved by: https://github.com/jeffdaily
2025-02-06 23:44:11 +00:00
3f5ed05688 [Windows][ROCm] Fix c10 hip tests (#146599)
- Solves a problem related to .hip source files being ignored by the build system when HIP language is not enabled in CMake.
- Also ensures that the test executables link to an appropriate CRT Runtime Library and hence have access to all the necessary symbols. Previously, there were many problems related to linkage errors.
- Moves part of Linux-related hipBLASLt changes in `LoadHIP.cmake` under the UNIX conditional branch, as these aren't supported on Windows yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146599
Approved by: https://github.com/jeffdaily
2025-02-06 23:41:25 +00:00
e13a544b54 fix tf32 issue in test_inductor_freezing.py unit tests (#146444)
Test is hitting numerical mismatches in NVIDIA internal CI. Add tf32_on_and_off decorater, update check to assertEqual

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146444
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/eqy
2025-02-06 23:34:28 +00:00
eqy
7bd7f735d4 [CUDA][SDPA] Compute reference in test_triton_scaled_dot_product_attention_block_size_16_cuda_float32 in float64 (#146461)
Seems to currently fail with mismatches in the 1e-4 range presumably due to sdpa calling into the `MATH` backend here which is less fused than a triton kernel. Doing the ref computation in `float64` appears to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146461
Approved by: https://github.com/drisspg
2025-02-06 23:28:56 +00:00
2834fe5e93 [inductor] Fix test error test_force_cutlass_backend_aoti_cexpr_codegen (#146564)
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend -- --exact 'caffe2/test/inductor:cutlass_backend - test_force_cutlass_backend_aoti_cexpr_codegen (caffe2.test.inductor.test_cutlass_backend.TestCutlassBackend)'
```

Differential Revision: D69219873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146564
Approved by: https://github.com/yanboliang
2025-02-06 23:02:41 +00:00
0c81b398ab [BE][Ez]: Enable some additional pylint ruff warnings (#146609)
Some additional code hardening with some pylint warnings in ruff that usually indicate bugs.  All code currently conforms nicely to them, but this will ensure these errors can be detected statically before running / creating tests.

The follow rules:
* Ban walrus operators where they would have no effect over regular assignment; making intention more clear.
* Statically check for the common error of forgetting to put parens after the `super` call, which will cause an attribute error
* Ban bad string literal args to builtins `open`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146609
Approved by: https://github.com/aorenste
2025-02-06 21:58:08 +00:00
99dd846672 [torch] fix builds for older pybind (#146630)
Summary:
some versions of pybind we build with don't have `py::set_error`.

So just use the underlying python C API.

Test Plan: unit tests

Differential Revision: D69254629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146630
Approved by: https://github.com/colin2328, https://github.com/ngimel
2025-02-06 21:22:00 +00:00
3008368b12 Honor Dr.CI classification results on auto commit hash update (#146337)
Disable `ignore_flaky_failures` was a safer choice, but it seems that this option doesn't work with the current state of the CI.  For example, https://github.com/pytorch/pytorch/pull/125806 hasn't been merged since May because there would always be a failure in one type or another.  This effectively disables the automate mechanism.

My proposal here is to relax this rule and allows the bot to merge auto commit has update with `@pytorchbot merge` like a regular PR.  Then we will at least have something working.  If this causes issue, we can revert it back and try to longer route of improving CI reliability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146337
Approved by: https://github.com/clee2000
2025-02-06 20:33:38 +00:00
44b69b80c2 [ROCm][TunableOp] Future proof TunableOp unit test. (#146548)
TunableOp UT will fail because the regular expression in the test will not work for future versions of ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146548
Approved by: https://github.com/jeffdaily
2025-02-06 20:26:02 +00:00
5cc1b54a91 [2/N][cp][example] flex attention in context parallel (backward pass) (#146397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146397
Approved by: https://github.com/fegin
ghstack dependencies: #145896
2025-02-06 19:50:02 +00:00
6220c64aea [1/N][cp][example] flex attention in context parallel (forward pass) (#145896)
**Description**
This is an example of how FlexAttention can be used in a context parallel fashion. Right now it's only a flex_attention call with collectives added and has no load balancer, but we're about to add the missing parts step by step:
1. backward pass
2. static load balancing for causal masking
3. dynamic load balancing for other general maskings
4. automatic collective insertion solution
5. non-intrusive context parallel APIs

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/tensor/examples/flex_attention_cp.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145896
Approved by: https://github.com/fegin, https://github.com/Skylion007
2025-02-06 19:50:02 +00:00
5ecdc428b2 [while_loop][inductor] support sym expression as cond_fn output (#146222)
As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn.

aoti generated output code looks like:
```
V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]     bool buf7_cond_result;
....
(while_loop_cond_graph_0_arg2_1_handle);
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         buf7_cond_result = u0 + u1 < 10L;
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         if (!buf7_cond_result) break;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222
Approved by: https://github.com/desertfire
ghstack dependencies: #146194, #146195
2025-02-06 19:39:55 +00:00
1b879fd0ea [Inductor] Add a JIT Inductor unit test following #146293 (#146529)
Summary: To follow up https://github.com/pytorch/pytorch/pull/146293, add a JIT Inductor unit test. Other Triton template may need similar fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146529
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-02-06 19:21:15 +00:00
992388c100 [inductor] use ftz variant of exp (#146216)
Inductor generated exp op is compiled as the following ptx snippet by Triton.

```
        mul.f32         %f74, %f83, 0f3FB8AA3B;
        ex2.approx.f32 %f73, %f74;
```

But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as
```
	mul.ftz.f32 	%f2, %f1, 0f3FB8AA3B;
	ex2.approx.ftz.f32 	%f3, %f2;
```
which uses the FTZ variant.

Let Inductor able to generate the FTZ variant if use_fast_math config is true.

I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216
Approved by: https://github.com/jansel, https://github.com/eellison
2025-02-06 19:12:35 +00:00
9ee506bd93 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-02-06 19:04:50 +00:00
eqy
07b214402a [CUDA][B200] Update the number of threads in avg_pool2d backward for SM 10.0 (#145669)
Fixes register count issue when launching on SM 10.0, originally authored by @bilal2vec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145669
Approved by: https://github.com/nWEIdia, https://github.com/ngimel
2025-02-06 18:57:33 +00:00
99ddbb4802 [dynamo][fullgraph] Do not skip frame with fullgraph=True (#146527)
Earlier if there were no ops in the graph, fullgraph=True will also fallback to eager. This hides issues in testing, where we silently fallback to eager, and do not test optimized bytecode. As can be seen in the PR, I had to fix several tests when I forced to use the optimized bytecode in the absence of graph. A few failing tests will be fixed in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146527
Approved by: https://github.com/zou3519, https://github.com/StrongerXi
2025-02-06 18:56:07 +00:00
15b1ac3e86 Add torch.func.debug_unwrap (#146528)
Use it to unwrap any functorch-wrapped tensor. I don't recommend using
the output in a program since it breaks the semantics of the transforms,
but it seems useful for debugging.

I will note that some people have wanted to get intermediate values out
of an e.g. grad transform, so this might be a way to do that...

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146528
Approved by: https://github.com/Chillee
2025-02-06 18:48:09 +00:00
49082f9dba parallelize sort (#142391)
- use __gnu_parallel::sort for gcc compilations
- add a parallelized version of std::sort and std::stable_sort for non gcc compilations

Using __gnu_parallel::sort:
provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

The performance is measured using the following script:
```python
import torch
import torch.autograd.profiler as profiler

torch.manual_seed(0)

N = 50000
x = torch.randn(N, dtype=torch.float)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):
        _, _ = torch.sort(x)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142391
Approved by: https://github.com/malfet
2025-02-06 18:06:40 +00:00
7725d0ba12 [METAL] inline bfloat min/max (#146588)
After a recent commit 36c6e09528a7e071edecde083254da70cba26c95 , building from source with `python setup.py develop` leads to an error due to multiple symbols for min/max:
```
FAILED: caffe2/aten/src/ATen/kernels_bfloat.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_bfloat.metallib
cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_bfloat.metallib BinaryKernel_31.air Bucketization_31.air CrossKernel_31.air FusedOptimizerOps_31.air Gamma_31.air HistogramKernel_31.air Im2Col_31.air Indexing_31.air LinearAlgebra_31.air Quantized_31.air RMSNorm_31.air RenormKernel_31.air Repeat_31.air SpecialOps_31.air TriangularOps_31.air UnaryKernel_31.air UnfoldBackward_31.air UpSample_31.air
LLVM ERROR: multiple symbols ('_ZN3c105metal3minIDF16bEEN5metal9enable_ifIXgssr5metalE19is_floating_point_vIT_EES4_E4typeES4_S4_')!
```

This PR fixes that.

@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146588
Approved by: https://github.com/FFFrog, https://github.com/Skylion007, https://github.com/malfet
2025-02-06 17:57:31 +00:00
e2e265e27b [dynamo] Use polyfill to implement comparison operators (#144485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485
Approved by: https://github.com/jansel
2025-02-06 17:27:07 +00:00
1090e58687 [mps] Remove a stale comment. (#146619)
The implementation of the function was moved to a shader, but the comment was left there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146619
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-06 17:25:29 +00:00
46390e9a37 [mps] Implement support for sinc() operator (inductor and eager). (#146539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146539
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 16:37:27 +00:00
a14c780c4c [dynamo] fix dynamo_compile logging on RecompileLimitExceeded (#146544)
Logging branches based on RecompileLimitExceeded or not. If we exceed the limit, we fallback to eager before even trying to analyze the frame. We handle RecompileLimitExceeded outside of the try/catch/finally that edits the metrics context:
72405b0c0f/torch/_dynamo/convert_frame.py (L908-L935).

dynamo_config and recompile_reason are both known before we raise the RecompileLimitExceeded, so we can add them with the rest of the "common" metrics. which are logged on metric_context decorator exit and is always called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146544
Approved by: https://github.com/masnesral
2025-02-06 16:20:42 +00:00
6ff3383157 Enable CUPTI on Windows (#141454)
Fixes:
- https://github.com/pytorch/pytorch/issues/93855

The PR enables CUPTI on Windows and enables unit tests to check CUDA profiling events.
Additionally, the changes can be verified using the following script:

```
import torch
from torch.profiler import profile, ProfilerActivity

def check_cupti_enabled():
    # Check if CUDA is available
    if not torch.cuda.is_available():
        print("CUDA is not available on this system.")
        return False

    # Create a simple CUDA tensor
    x = torch.randn(1000, 1000, device="cuda")
    y = torch.randn(1000, 1000, device="cuda")

    try:
        # Use PyTorch profiler to perform a basic check
        with profile(activities=[ProfilerActivity.CUDA]) as prof:
            z = x @ y  # Simple CUDA operation

        # Print profiling results
        print("CUPTI is enabled and profiling works.")
        print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
        return True
    except RuntimeError as e:
        # If profiling fails, CUPTI is likely not set up correctly
        print("Error: CUPTI might not be enabled or accessible.")
        print(f"Details: {e}")
        return False

if __name__ == "__main__":
    if check_cupti_enabled():
        print("CUPTI is properly configured in PyTorch.")
    else:
        print("CUPTI is not configured correctly. Check your CUDA installation.")
```

Sample output:
```
CUPTI is enabled and profiling works.
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
     sgemm_128x128x8_NN_vec         0.00%       0.000us         0.00%       0.000us       0.000us       2.086ms       100.00%       2.086ms       2.086ms             1
                   cudaFree         9.67%       9.816ms         9.67%       9.816ms       9.816ms       0.000us         0.00%       0.000us       0.000us             1
     cudaDeviceGetAttribute         0.01%      10.000us         0.01%      10.000us       0.476us       0.000us         0.00%       0.000us       0.000us            21
    cudaGetDriverEntryPoint         0.00%       1.700us         0.00%       1.700us       0.850us       0.000us         0.00%       0.000us       0.000us             2
       cudaGetSymbolAddress        85.15%      86.438ms        85.15%      86.438ms      86.438ms       0.000us         0.00%       0.000us       0.000us             1
                 cudaMalloc         0.43%     433.300us         0.43%     433.300us     144.433us       0.000us         0.00%       0.000us       0.000us             3
           cudaLaunchKernel         2.61%       2.648ms         2.61%       2.648ms       2.648ms       0.000us         0.00%       0.000us       0.000us             1
      cudaDeviceSynchronize         2.13%       2.163ms         2.13%       2.163ms       2.163ms       0.000us         0.00%       0.000us       0.000us             1
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 101.511ms
Self CUDA time total: 2.086ms

CUPTI is properly configured in PyTorch.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141454
Approved by: https://github.com/malfet
2025-02-06 15:58:20 +00:00
FEI
8a4dd763b8 [CCA] remove TODO for hardware_destructive_interference_size (#145591)
@zyan0 @albanD  @houseroad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145591
Approved by: https://github.com/albanD
2025-02-06 14:41:25 +00:00
ed309b9156 Re-add stft option to align window for center = false (#146379)
Skips advancing the fc window on https://github.com/pytorch/pytorch/pull/145437, since I just found that there were non-trivial efforts to do so a while ago that eventually was reverted: https://github.com/pytorch/pytorch/pull/73434

Works around the issue by keeping the stft sans center overload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146379
Approved by: https://github.com/justinchuby, https://github.com/iseeyuan
2025-02-06 14:07:13 +00:00
1b79d47635 Revert "[dynamo] check for incompatible configs (#146513)"
This reverts commit aab7925418be561a8af6adfcb8cf009a8786c31b.

Reverted https://github.com/pytorch/pytorch/pull/146513 on behalf of https://github.com/atalman due to inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/13174131431/job/36772837627) [HUD commit link](4a545eb85d) ([comment](https://github.com/pytorch/pytorch/pull/146513#issuecomment-2639860568))
2025-02-06 13:42:25 +00:00
340cfe4f28 [dynamo][fbcode] Turn on inline_inbuilt_nn_modules (#145407)
As title.

Some internal testing at https://fb.workplace.com/groups/241460628989036/permalink/411650015303429/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145407
Approved by: https://github.com/ezyang, https://github.com/jansel
2025-02-06 13:18:35 +00:00
bd7d4fb2b5 Revert "[DTensor][Test] Create a simple unit test for tensordot (#146514)"
This reverts commit 1f8baf09ea598c97f30731ddb8328b6aa8d31fe9.

Reverted https://github.com/pytorch/pytorch/pull/146514 on behalf of https://github.com/albanD due to The lint failures that you ignored are real right? ([comment](https://github.com/pytorch/pytorch/pull/146514#issuecomment-2639554636))
2025-02-06 11:26:43 +00:00
4a545eb85d Fix torch.nn.functional.one_hot param num_classes optional description (#146470)
`torch.nn.functional.one_hot` [document](https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html) describe param `num_classes` not optional, but user can call method without pass it.

![image](https://github.com/user-attachments/assets/4e6d4feb-691f-451f-95b5-4ac11bac7bc2)

```python
>>> import torch
>>> a = torch.arange(0, 5) % 3  # [0,1,2,0,1]
>>> torch.nn.functional.one_hot(a)
tensor([[1, 0, 0],
        [0, 1, 0],
        [0, 0, 1],
        [1, 0, 0],
        [0, 1, 0]])

```

`num_classes` has default value -1

93d98aca31/aten/src/ATen/native/native_functions.yaml (L6154-L6157)

## Test Result

![image](https://github.com/user-attachments/assets/2c7203b7-6226-4ebc-84c8-cbf912fc48e2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146470
Approved by: https://github.com/albanD
2025-02-06 07:48:05 +00:00
aab7925418 [dynamo] check for incompatible configs (#146513)
internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/

Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time.

Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513
Approved by: https://github.com/williamwen42
2025-02-06 07:39:52 +00:00
eqy
5f0901e573 [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-06 05:57:33 +00:00
36c6e09528 [MPSInductor] Fix min/max for bfloat16 (#146552)
By introducing a full specialization that upcasts everything to float, as bfloat does not have a native min/max

Test by runing `test_min_max_reduction`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146552
Approved by: https://github.com/dcci
2025-02-06 05:15:00 +00:00
1f8baf09ea [DTensor][Test] Create a simple unit test for tensordot (#146514)
Fixes #ISSUE_NUMBER

The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514
Approved by: https://github.com/tianyu-l
2025-02-06 05:09:34 +00:00
e01a5e9e1e Small improvements to NJT matrix multiplies (#146405)
Fixes #146404

Adds changes to the matmul and matmul_backward operation for nested jagged tensors, to support back propagation when the output is a regular strided tensor.
This required adding support for the nested matmul operation to work when the nested tensor wasn't 'self', i.e
`A@B` where `A` isn't nested but `B` is.

The operation schemas had to be updated to reflect that either input can be a strided tensor instead (and the gradient), so an extra assertion is added in an edge case where neither input is nested.

Unit tests are also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146405
Approved by: https://github.com/soulitzer, https://github.com/jbschlosser
2025-02-06 04:51:12 +00:00
389c5c0842 print out partial fx graph for all data-dependent errors (#146363)
The previous implementation didn't catch the following type of errors

```
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not extract specialized integer from data-dependent expression u2 (unhinted: u2).  (Size-like symbols: none)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146363
Approved by: https://github.com/angelayi, https://github.com/bdhirsh
ghstack dependencies: #146298, #146296
2025-02-06 04:21:34 +00:00
425804db2b [torch] fix exception types in custom class magic setattr/getattr (#146516)
Summary:
`c10::AttributeError` is not automatically converted to Python AttributeError, it needs some special macros (e.g. `HANDLE_TH_ERRORS`).

Some Python functions like `hasattr` rely on the type of the throw exception to be correct.

We don't need the fully generality of those macros, so just do a targeted error type conversion here.

Test Plan: added unit test

Differential Revision: D69197217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146516
Approved by: https://github.com/zdevito
2025-02-06 02:14:11 +00:00
3a6a203b98 [dynamic shapes][real tensor tracing] propagate unbacked hint when creating mod replacement (#146381)
Fixes data-dependent errors for 2 PT2I models in draft export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146381
Approved by: https://github.com/angelayi
2025-02-06 01:48:40 +00:00
c5062cca98 [export] make stack_trace optional in insert_custom_op_guards (#146438)
Summary: Fixes 1 PT2I exportability error

Test Plan: -

Differential Revision: D69132186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146438
Approved by: https://github.com/yiming0416, https://github.com/angelayi
2025-02-06 01:48:26 +00:00
6a985d8b2e Make inductor_utils.requires_gpu accept MPS (#145156)
Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU
(Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped)

- Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2`  otherwise they GPU tests are just running for _cpu suffixes.
- Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0`
- UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU
- Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel
2025-02-06 01:14:36 +00:00
0dc03134d9 [MPS] linalg solve implementation (#146531)
Fixes #98222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146531
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 00:57:49 +00:00
495049860b [BE][Metal] Fix signed unsigned comparison warning (#146549)
I wish I knew how to extract Metal warnings during JIT compilation but https://developer.apple.com/documentation/metal/mtldevice/makelibrary(source:options:)?changes=_7&language=objc is a lie as `error:` stays `nil` unless shader compilation fails. But when it does following warnings are thrown
```
program_source:666:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:677:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:688:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:699:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:710:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:723:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146549
Approved by: https://github.com/dcci
2025-02-06 00:40:17 +00:00
e0cf519ade Revert "[inductor] Refactor op handlers part 2 (#146252)"
This reverts commit 13f0436abdff0386f33c7a8c25caa66e9af16dbd.

Reverted https://github.com/pytorch/pytorch/pull/146252 on behalf of https://github.com/atalman due to Sorry need to revert, failing internally ([comment](https://github.com/pytorch/pytorch/pull/146252#issuecomment-2638305417))
2025-02-06 00:04:04 +00:00
c7087d6b14 [BE][EZ][Metal] Do not pass tensor length as arg (#146522)
As all devices capable of running Metal-2 support nonuniform threadgroup sizes, see https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf for more detail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146522
Approved by: https://github.com/dcci
ghstack dependencies: #146521
2025-02-06 00:03:41 +00:00
54ef029532 [BE][EZ][Metal] Mark constant inputs as constant (#146521)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146521
Approved by: https://github.com/dcci
2025-02-06 00:03:41 +00:00
2001066c61 Revert "[inductor] Refactor op handlers part 3 (#146254)"
This reverts commit 8e9bda8d895e80da0fe480d02e100bae8332ed57.

Reverted https://github.com/pytorch/pytorch/pull/146254 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146254#issuecomment-2638300857))
2025-02-05 23:59:50 +00:00
72405b0c0f [ca] refactor compile reasons and log to tlparse (#146386)
This PR accumulates comple reasons inside each CacheNode, and logs them to tlparse on each CA compile. This defines a compile as an autograd structure change, and a recompile as a dynamic shape change.

sample tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpdbo7gt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

for compiles:
```python
[
  "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]"
]
```

for recompiles:
```python
[
  "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]",
  "!1: Cache miss due to 7 changed tensor shapes (total of 7): sizes[0], sizes[1], sizes[2], sizes[3], sizes[4], sizes[5], sizes[6]"
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146386
Approved by: https://github.com/jansel
ghstack dependencies: #146229
2025-02-05 23:33:21 +00:00
68304dba7a Revert "[inductor] Refactor op handlers part 4 (#146255)"
This reverts commit 7aced455c542f629ffcd4f79c6af259bb966add8.

Reverted https://github.com/pytorch/pytorch/pull/146255 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146255#issuecomment-2638258089))
2025-02-05 23:24:20 +00:00
49effa0deb Revert "[inductor] Refactor op handlers part 5 (#146257)"
This reverts commit d3dd3eeb7f599a2816ba1a067a8fa5a1bb1c84c3.

Reverted https://github.com/pytorch/pytorch/pull/146257 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146257#issuecomment-2638251994))
2025-02-05 23:20:38 +00:00
93e1e6e07c Revert "[inductor] Minor compile time optimizations in DefaultHandler (#146282)"
This reverts commit b8a529cca18ae4d21b1681c5ea3a40635aba5a83.

Reverted https://github.com/pytorch/pytorch/pull/146282 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146282#issuecomment-2638239575))
2025-02-05 23:13:08 +00:00
7dc5cfe2ad Revert "[inductor] Refactor CaptureIndexing into global scope (#146297)"
This reverts commit 7288950bcd4c5851e003dded6ce87da643b93e49.

Reverted https://github.com/pytorch/pytorch/pull/146297 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146297#issuecomment-2638234829))
2025-02-05 23:10:08 +00:00
9555bfce88 Revert "[inductor] Pre-populate cache for simplify_with_ranges return value (#146373)"
This reverts commit 84ba9c6e7844a0b457bc64ca70a9c8cf3655d03d.

Reverted https://github.com/pytorch/pytorch/pull/146373 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146373#issuecomment-2638232033))
2025-02-05 23:07:08 +00:00
8af31e30d7 [Codemod][AddExplicitStrictExportArg] caffe2/torch (#146439)
Differential Revision: D69068432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146439
Approved by: https://github.com/avikchaudhuri
2025-02-05 22:56:54 +00:00
97b64f2e5c Fix workflow for closing nonexistent disable issues (#146447)
The workflow could not update issues because it didn't have permissions, and it looked green because it didn't check return codes.

Tested by running the workflow and seeing that issues did get closed
Fixes https://github.com/pytorch/pytorch/issues/145382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146447
Approved by: https://github.com/huydhn
2025-02-05 22:29:05 +00:00
9b6d680131 Remove stage_index_to_group_rank from schedule (#146217)
This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217
Approved by: https://github.com/wconstab
ghstack dependencies: #146193
2025-02-05 21:26:45 +00:00
4ee7d0de86 Add generate_stage_to_rank_mapping utility (#146193)
We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it.

This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely.

Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193
Approved by: https://github.com/wconstab
2025-02-05 21:26:45 +00:00
98b5d455fd [opcheck] Improve error reporting; allow atol/rtol overrides (#146488)
This PR improves opcheck to:
1. directly use torch.testing.assert_close (without a msg override).
   This allows it to print the absolute and relative differences and the
   number of mismatched elements.
2. take in an atol/rtol tolerance (for if someone just wants to use
   opcheck in their testing).

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146488
Approved by: https://github.com/williamwen42
2025-02-05 21:25:06 +00:00
1f6b566d74 [ONNX] Bump onnx and onnxscript versions in CI (#146097)
Bump onnx onnxscript==0.1 in CI; Skipped onnxruntime 1.19 because it has regression on avgpool.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146097
Approved by: https://github.com/malfet
2025-02-05 21:00:25 +00:00
9da376daa6 Add retain-output argument (#145921)
This PR add retain-output argument which enables appending to the already existing output file if it exists instead of deleting it and creating a new one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145921
Approved by: https://github.com/jansel
2025-02-05 19:45:09 +00:00
dd349207c5 Add check that envvar configs are boolean (#145454)
So we don't get unexpected behavior when higher typed values are passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145454
Approved by: https://github.com/c00w, https://github.com/jamesjwu
2025-02-05 19:40:10 +00:00
9091096d6c Refactoring Distributed test cases to be device agnostic [1/n] (#145222)
In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic.

These changes will include the following approaches to do the same :

- Allowing for multiple device types using instantiate_device_type_test
- Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies
- Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable
- Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536).

This should result in significant improvement in usability for all devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145222
Approved by: https://github.com/kwen2501
2025-02-05 18:47:09 +00:00
eqy
6f7fda3f49 Bump nn.functional.conv3d tolerances for test_comprehensive (#135719)
`float16` tolerance was previously set to `1e-5` which seemed very low
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135719
Approved by: https://github.com/Chillee, https://github.com/albanD
2025-02-05 18:34:12 +00:00
d2a2b9f8a7 Fix constants with non-functional operators (#145593)
Previously, in non-strict path, we always error when trying to inplace update a constant tensor because those constant tensors are not actually wrapped by functional tensors. This is correct behaviour in torch.compile, because dynamo makes all constant tensors into buffers and AOTDispatcher just lifts them and wraps them in functional tensors. However, in non-strict, there is no such step that registers constants as buffers so AOTDispatcher panics when it sees these dangling constant tensors when functioanalizing.

Due to recent change in the IR, this is no longer an issue in non-strict path because we don't call AOTDispatcher at training IR level, but now it is a problem for both strict and non-strict when we lower to inference. (lowering to inference is very similar to non-strict tracing) As a result, we have at least one external (https://github.com/pytorch/pytorch/issues/141336) and internal issues reported due to this difference.

To fix this, there are two ways:
1. Make functionalization be aware of constant tensors and map them to functional tensors on the fly. This makes functionalization invariant uglier and could potentially open up a gate for more nasty bugs.
2. Special handle this in export. This seems more aligned with what dynamo does today so i think we should do it this way. I think the current state could benefit from more refactors to make the run_deocmpositions to be more similar to strict export (because both of them now handle this constant registerinig logic) but it is bit complicated to do it now because strict export version of this logic is also not complete because it doesn't take into account of export graph renaming pass etc). I will follow up with more refactors after this PR (T213466691) to unblock users faster.

For future reference:

Why are we not doing "turning constants into non-persistent buffers and never de-register"? The reason is because in some internal models, they rely on module.to to reliably work to move params/buffers to correct device. As a result, buffers are moved while constants are not. In composibility meeting, we agreed that export won't do device agnostic tracing going forward (it will provide a way to specify FakeTensor in CPU that can be configured to be run on GPU), so after that is done, we can always turn constants into non-persistent buffers which will simplify export's constant handling.

Differential Revision: [D68610739](https://our.internmc.facebook.com/intern/diff/D68610739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145593
Approved by: https://github.com/avikchaudhuri
2025-02-05 17:44:19 +00:00
44248c44eb [ROCm] miopen benchmark behavior now better aligns with cudnn (#145294)
The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145294
Approved by: https://github.com/BrianHarrisonAMD, https://github.com/malfet
2025-02-05 17:19:53 +00:00
f27220e32a Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 157d81c201715f84ead21d0ee420669ab7f58c04.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/atalman due to Failing internally, sorry need to revert ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2637443675))
2025-02-05 16:39:37 +00:00
f55c0af37f [inductor] Support non-power-of-2 cooperative RSPLIT (#145689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145689
Approved by: https://github.com/eellison
2025-02-05 16:36:53 +00:00
db22e9d5a2 Implement blend operation for float, double, int in VEC ATen backend for SVE (#146479)
- Added support for SVE vectorized blend operation for float, double, int8_t, int16_t, int32_t and int64_t data types.
- Utilizes SVE ACLE intrinsic (svcntb, svcntw, svcmpne, svsel) to handle different vector lengths (VL) dynamically.
-  Ensured compatibility with SVE128, SVE256, and SVE512 hardware configurations.
-  Enabled back blend SVE vec tests

**Testing:**
**a) Float DType:**
./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/0.Blend    [Test Passed] on Graviton 3 machine (SVE256)
./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/0.Blend    [Test Passed] on Graviton 4 machine (SVE128)

**b) Double DType:**
./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/1.Blend    [Test Passed] on Graviton 3 machine (SVE256)
./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/1.Blend    [Test Passed] on Graviton 4 machine (SVE128)

**c)Int DType:**
python3 test/inductor/test_cpu_repro.py CPUReproTests.test_vec_remainder
[Test Passed] on Graviton 3 machine (SVE256) and on Graviton 4 machine (SVE128)
<img width="661" alt="grv4_test_case_passed" src="https://github.com/user-attachments/assets/5572fcc0-a861-4bd6-bf9e-356219ffe656" />

Fixes https://github.com/pytorch/pytorch/issues/146309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146479
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-05 16:29:13 +00:00
cd6c0707a8 [aoti] Assign proxy call args by name, and support default values. (#146263)
Fixing the following issue when compiling the following program:
```
                window = torch.hann_window(N_FFT).to(x.device)
                stft = torch.stft(
                    x, N_FFT, HOP_LENGTH, window=window, return_complex=True
                )
                magnitudes = stft[..., :-1].abs() ** 2
                return magnitudes
```
```
Traceback (most recent call last):
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
    self.check_model(model, example_inputs)
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
    actual = AOTIRunnerUtil.run(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
    optimized = AOTIRunnerUtil.load(device, so_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
    return torch._export.aot_load(so_path, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
2025-02-05 15:43:05 +00:00
1bb977a2a4 [auto_functionalized] Support Tensor(a!)[]? (#145400)
Summary:
This is just updating some of the checks to allow the Tensor(a!)[]? type
through.

Fixes #144072

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145400
Approved by: https://github.com/laithsakka
2025-02-05 14:52:39 +00:00
282d185ec1 Revert "[inductor] use ftz variant of exp (#146216)"
This reverts commit b0b3fe8bcf00f30513e9bb3e197ea4cbcc2beef0.

Reverted https://github.com/pytorch/pytorch/pull/146216 on behalf of https://github.com/atalman due to inductor/test_op_completeness.py::TestOpCompleteness::test_triton_overrides [GH job link](https://github.com/pytorch/pytorch/actions/runs/13152430750/job/36702812599) [HUD commit link](b0b3fe8bcf) ([comment](https://github.com/pytorch/pytorch/pull/146216#issuecomment-2636961317))
2025-02-05 14:13:45 +00:00
8a2000fd42 [MPS] Implement support for zeta (both eager and inductor). (#146465)
A test was failing in inductor (`test_pointwise_zeta`) -- and I realized the operation was missing also from eager.
Implemented for both, leveraging the kernel. Happy to split in two (one PR for eager, one for inductor) if folks prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146465
Approved by: https://github.com/malfet
2025-02-05 13:55:50 +00:00
fd0cd6a08f [ROCm][TunableOp] Improve identification of fastest solution (#144942)
This PR addresses some stability issues with identifying the fastest solution on AMD GPUs, particularly the MI300.

Changes include:
- An improved timer, StreamTimerNoSync
- More aggressive skipping of slow solutions
- Additional statistics that can be used for diagnostics PYTORCH_TUNABLEOP_VERBOSE=3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144942
Approved by: https://github.com/jeffdaily
2025-02-05 11:16:49 +00:00
e20b0c82d1 [ca] no longer require is_traceable annotations for c++ autograd functions (#146229)
This PR removes the CA compile-time error for C++ autograd functions, and supports them by having dynamo graph break on them (instead of allow_in_graph). The CppNode's collects are kept as is for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146229
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-02-05 08:49:17 +00:00
cyy
6293d1446b [2/N] Remove NOLINT suppressions (#146402)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146402
Approved by: https://github.com/soulitzer
2025-02-05 08:38:52 +00:00
e5ea7e9cdc add support for capturing provenance of unary operations (#146413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-05 08:31:38 +00:00
b0b3fe8bcf [inductor] use ftz variant of exp (#146216)
Inductor generated exp op is compiled as the following ptx snippet by Triton.

```
        mul.f32         %f74, %f83, 0f3FB8AA3B;
        ex2.approx.f32 %f73, %f74;
```

But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as
```
	mul.ftz.f32 	%f2, %f1, 0f3FB8AA3B;
	ex2.approx.ftz.f32 	%f3, %f2;
```
which uses the FTZ variant.

Let Inductor able to generate the FTZ variant if use_fast_math config is true.

I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216
Approved by: https://github.com/jansel
2025-02-05 07:35:43 +00:00
clr
93d98aca31 inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)
If a nn.module getattr call throws, we should make sure that we don't crash with an internal error

Note that I couldn't figure out how to test this, so advice would be awesome.  I have my best case attempt at  https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122
Approved by: https://github.com/jansel
2025-02-05 05:49:32 +00:00
eb832b7bcc [export] Fix draft-export logging (#146106)
Summary: Fix issue where the lazyTraceHandler does not exist

Test Plan: CI

Differential Revision: D68928070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146106
Approved by: https://github.com/yiming0416
2025-02-05 05:49:22 +00:00
f242da41c7 Revert "move and fix logic to update unbacked bindings (#146115)"
This reverts commit 0144613e6ff6e018ca41085d1509dcceb80987f7.

Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2635695958))
2025-02-05 04:51:39 +00:00
cyy
c6ea4425e5 Enable some tests on Windows (#146243)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146243
Approved by: https://github.com/albanD
2025-02-05 03:54:28 +00:00
f35e60b21c Revert "[cutlass backend] fix bug for accuminator dtype (#146356)"
This reverts commit 7c8ec84dab7dc10d4ef90afc93a49b97bbd04503.

Reverted https://github.com/pytorch/pytorch/pull/146356 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some slow cutlass tests are failing ([comment](https://github.com/pytorch/pytorch/pull/146356#issuecomment-2635594712))
2025-02-05 03:01:50 +00:00
3c0d2bc262 Revert "[Testing] Reduce test_exp flakiness (#146436)"
This reverts commit 4c5a9a5f949ef3019fc3ef095034ccfc973ff13d.

Reverted https://github.com/pytorch/pytorch/pull/146436 on behalf of https://github.com/huydhn due to Some test_exp2 starts failing in trunk I think ([comment](https://github.com/pytorch/pytorch/pull/146436#issuecomment-2635591878))
2025-02-05 02:58:53 +00:00
aafaf4016f [MPS] Add error checking when dispatching kernel (#146458)
That thread-group size should not exceed maximum thread group size
Add regression test to validate that
Make failures like https://github.com/pytorch/pytorch/issues/146430 much easier to detect
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146458
Approved by: https://github.com/dcci
2025-02-05 02:56:40 +00:00
9e45bc82e9 [aarch64] CUDA 12.8 aarch64 builds to nightly binaries (#146378)
https://github.com/pytorch/pytorch/issues/145570

Adding Cuda 12.8 and keeping 12.6 for the sbsa build, supported CUDA_ARCH: 9.0, 10.0, 12.0

Refactor the binaries matrix for cuda sbsa build. Previously cuda-aarch64 was hardcoded to cuda 12.6. Now reads 12.6 and 12.8, new build naming example [manywheel-py3_9-cuda-aarch64-12_8-build](https://github.com/pytorch/pytorch/actions/runs/13132625006/job/36640885079?pr=146378#logs)

TODO: once 12.8 is stable, remove 12.6 in sbsa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146378
Approved by: https://github.com/atalman
2025-02-05 02:55:21 +00:00
001ad5bef5 [MPSInductor] Scope-down test_prod running in MPS (#146460)
As mutli-stage reductions are yet not a thing, but original `test_prod` just returned 0 for large reductions, so failures were reported as flaky ones, but if one to run the same test with `MTL_DEBUG_LAYER=1` than failure was obvious
```
2025-02-04 11:51:30.034 Python[16594:289093] Metal API Validation Enabled
test_prod (__main__.MPSBasicTests.test_prod) ... -[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1266: failed assertion `(threadsPerThreadgroup.width(1) * threadsPerThreadgroup.height(2050) * threadsPerThreadgroup.depth(1))(2050) must be <= 1024. (device threadgroup size limit)'
```

Fixes https://github.com/pytorch/pytorch/issues/146430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146460
Approved by: https://github.com/dcci
2025-02-05 01:47:01 +00:00
52aaadf379 [BE][Ez]: Enable ruff rule E731. use def instead of anonymous lambda (#146410)
Not sure why this isn't enabled, only 1 fix is needed and it supports autofixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146410
Approved by: https://github.com/aorenste, https://github.com/albanD
2025-02-05 01:44:41 +00:00
0e060342b6 [triton] Update pin to tip of 3.2 release (#145867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867
Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte, https://github.com/jansel
2025-02-05 01:42:33 +00:00
616ac94175 [Dynamo] Fix spammy optimizer warning (#146374)
Fixes https://discuss.pytorch.org/t/torch-compile-optimizer-step-generates-excessive-warning-messages/216067/7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146374
Approved by: https://github.com/anijain2305
2025-02-05 01:03:49 +00:00
8177fc4d33 Make regex error catching compatible with Python 3.12+. (#145945)
In Python 3.12, the error message has changed from "Can't pickle local object" to "Can't get local object".
The old regex would no longer catch the error.

This PR make it compatible with Python 3.12 and backward compatible as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145945
Approved by: https://github.com/H-Huang
2025-02-05 00:57:36 +00:00
9d5bf38dec [cpp_builder] refactor to reduce libcudart_static logs (#146394)
Want to reduce logs from `log_msg = f'"libcudart_static.a" not found under {path}'`, which was added in https://github.com/pytorch/pytorch/pull/142175

Differential Revision: D69096354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146394
Approved by: https://github.com/benjaminglass1, https://github.com/chenyang78
2025-02-05 00:41:30 +00:00
658e22d495 Revert "add support for capturing provenance of unary operations (#146413)"
This reverts commit bc33d993acdff2637bc6aee5e604fb969b11fc13.

Reverted https://github.com/pytorch/pytorch/pull/146413 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but some export tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/146413#issuecomment-2635440261))
2025-02-05 00:32:40 +00:00
6e03f4f90e [export] Include metadata in FlatArgsAdapter (#146107)
Summary:
With https://github.com/pytorch/pytorch/pull/145956, which introduces
storing a list of namedtuple field names when serializing, we now want to
expose this list to the args adapater so that APS can utilize this information
and remove extraneous inputs.

Test Plan: No-op

Differential Revision: D68928416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146107
Approved by: https://github.com/pianpwk
2025-02-05 00:29:58 +00:00
84ba9c6e78 [inductor] Pre-populate cache for simplify_with_ranges return value (#146373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282, #146297
2025-02-04 23:36:44 +00:00
7288950bcd [inductor] Refactor CaptureIndexing into global scope (#146297)
And inline SimplifyIndexing into it CaptureIndexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282
2025-02-04 23:36:44 +00:00
b8a529cca1 [inductor] Minor compile time optimizations in DefaultHandler (#146282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257
2025-02-04 23:36:34 +00:00
d3dd3eeb7f [inductor] Refactor op handlers part 5 (#146257)
This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255
2025-02-04 23:36:25 +00:00
7aced455c5 [inductor] Refactor op handlers part 4 (#146255)
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.

Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254
2025-02-04 23:36:17 +00:00
8e9bda8d89 [inductor] Refactor op handlers part 3 (#146254)
Fixes type errors that arise from typing `V.ops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252
2025-02-04 23:36:09 +00:00
13f0436abd [inductor] Refactor op handlers part 2 (#146252)
This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252
Approved by: https://github.com/yanboliang
ghstack dependencies: #146225, #146226, #146235
2025-02-04 23:36:01 +00:00
67be5953fe [inductor] Refactor op handlers part 1 (#146235)
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.

Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%

WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%

WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
2025-02-04 23:35:53 +00:00
ed03f9ca10 [inductor] Refactor CSEProxy into global scope (#146226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226
Approved by: https://github.com/shunting314
ghstack dependencies: #146225
2025-02-04 23:35:43 +00:00
5cac550ddf [inductor] Finish typing common.py (#146225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225
Approved by: https://github.com/Skylion007
2025-02-04 23:35:33 +00:00
7c8ec84dab [cutlass backend] fix bug for accuminator dtype (#146356)
Will add unit tests for accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356
Approved by: https://github.com/Chillee
2025-02-04 22:10:17 +00:00
13e17aa106 Make the CUTLASS swizzle options configurable and default to 2. (#146088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146088
Approved by: https://github.com/henrylhtsang, https://github.com/mlazos
2025-02-04 22:07:26 +00:00
aac0577796 [TEST][Sparse] Force CUTLASS backend in TestSparseSemiStructuredCUTLASS (#146398)
We have noticed some discrepancy between the ways the `test_sparse_semi_structured.py` was called. And in some ways, the test falsely fails, because it was attempting to run on a wrong backend. All because `SparseSemiStructuredTensor._FORCE_CUTLASS = True` was never set in the setup of `TestSparseSemiStructuredCUTLASS` as it was in its `TestSparseSemiStructuredCUSPARSELT` counterpart 8444fe019a/test/test_sparse_semi_structured.py (L1039-L1046)

When I run tests via pytest, just by shear luck it calls `test_values_backend_cutlass_cuda` which sets the backend to CUTLASS bb4bd5f00b/test/test_sparse_semi_structured.py (L475) before `test_conversions_all_patterns_cuda_*`:
```
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_values_backend_cutlass_cuda PASSED [0.0071s]                                                                                          [ 72%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_bfloat16 PASSED [0.0484s]                                                                        [ 73%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_float16 PASSED [0.0041s]                                                                         [ 73%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_int8 PASSED [0.0079s]                                                                            [ 73%]
```
In this scenario everything is good.

But in `python test/test_sparse_semi_structured.py -v -k cuda` way, the order of the tests is not the same, and it sets cuSparseLt backend just before running `test_conversions_all_patterns_cuda_*` which causes failures:
```
test_cusparselt_backend_cuda (__main__.TestSparseSemiStructuredCUSPARSELTCUDA.test_cusparselt_backend_cuda) ... ok
...
test_conversions_all_patterns_cuda_bfloat16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_bfloat16) ... FAIL
test_conversions_all_patterns_cuda_float16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_float16) ... FAIL
test_conversions_all_patterns_cuda_int8 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_int8) ... ERROR
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146398
Approved by: https://github.com/Skylion007, https://github.com/jcaip, https://github.com/eqy
2025-02-04 22:07:12 +00:00
317dae95fa cpp_wrapper: fix CPU cpp_wrapper and max-autotune tests (#145683)
Both of these tests mostly failed due to incorrect assumptions about the generated code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145683
Approved by: https://github.com/desertfire
ghstack dependencies: #145095, #145654, #145655
2025-02-04 22:05:59 +00:00
e2a029054d cpp_wrapper: enable all CPU repro tests (#145655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145655
Approved by: https://github.com/desertfire
ghstack dependencies: #145095, #145654
2025-02-04 22:05:59 +00:00
9873319a42 cpp_wrapper: fix set_.source_Tensor lowering (#145654)
Adds a C-shim fallback for `set_.source_Tensor`, which is effectively required by `ir.SetSourceTensorKernel`. As a necessary prerequisite to use that IR node, updates `CppWrapperCpu` to handle in-place returns in C-shim ops (the arguments for those returns are silently dropped by `torchgen`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145654
Approved by: https://github.com/desertfire
ghstack dependencies: #145095
2025-02-04 22:05:59 +00:00
7c0fe7a045 cpp_wrapper/aot_inductor: handle conjugation and negation dispatch keys (#145095)
Handles conjugation and negation in the same way that runtime dispatch does: by on-the-fly cloning a tensor with either key applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145095
Approved by: https://github.com/desertfire
2025-02-04 22:05:58 +00:00
09b0dfdc90 [metal] Add a missing cast to make the call to copysign unambiguous. (#146422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146422
Approved by: https://github.com/Skylion007, https://github.com/Samkm0084
2025-02-04 22:04:25 +00:00
clr
4e194bbfd6 dynamo: fsdp throw unimplemented vs attribute error (#146188)
Rather than throw a full exception for fsdp, instead just return unimplemented,
and respect the user options (i.e. fullgraph, vs graph break).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146188
Approved by: https://github.com/jansel
2025-02-04 21:45:55 +00:00
4c5a9a5f94 [Testing] Reduce test_exp flakiness (#146436)
By setting `reference_in_float` to false,  as `exp(a + b)` could yield significantly different results than `exp(a.half()+b.half())` as one can see in the following example (which is accidentally the random values generated by MacOS RNG for this test)

```
>>> import torch
>>> x=torch.tensor(2.5599, dtype=torch.half)
>>> y=torch.tensor(0.6970, dtype=torch.half)
>>> (x + y).exp()
tensor(26., dtype=torch.float16)
>>> (x.float() + y.float()).exp()
tensor(25.9799)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146436
Approved by: https://github.com/dcci
2025-02-04 21:24:08 +00:00
bc33d993ac add support for capturing provenance of unary operations (#146413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-04 21:16:15 +00:00
07b9fe0690 [Trace PyDispatcher] Add CustomFunctionHigherOrderOperatorVariable (#146272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146272
Approved by: https://github.com/zou3519
ghstack dependencies: #146270, #146271
2025-02-04 20:55:51 +00:00
d23e4f8109 use DTRACE_ENV_VAR as the trace logs directory of set (#146412)
```
(/home/bobren/local/a/pytorch-env) [7:47] devgpu035:/home/bobren/local/a/pytorch TORCH_DTRACE=/tmp/bb python r1.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146412
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-04 20:54:28 +00:00
7f65a20884 [BE]: Enable ruff SLOT checks (#146276)
This enables a check that which a class which only inherits from immutable classes like str, tuple, and NamedTuple, also defined `__slots__` so they don't allocate memory unnecessarily. This also ensure contributors think about how they define their classes with subclass NamedTuples and str, of which we have many in our codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146276
Approved by: https://github.com/aorenste
2025-02-04 19:18:23 +00:00
3525b834f0 [MPSInductor] Implement argmax/argmin (#146429)
TODOs:
 - Find test with NaN
 - Report internal compiler error when running `test_argmax_argmin1` (which is actually not enough shared memory)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146429
Approved by: https://github.com/dcci
ghstack dependencies: #146423, #146428
2025-02-04 19:16:06 +00:00
c591ad0c03 dump partial fx graph to stderr when dynamo tracing fails with guard on data-dependent (#146296)
As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging.

The following code produces a data dependent error

```
import torch
from torch import nn

# UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)).  (Size-like symbols: u0)
class Repro(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, cache, update, pos):
        _, _, max_seq_len, _ = cache.shape
        _, _, seqlen, _ = update.shape

        pos_item = pos[0].item() # u0
        torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        torch._check(pos_item >= 0)
        before = cache.narrow(2, 0, pos_item)

        # FAIL
        # Laith: why can't we make unbacked expressions size-like?
        after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))

        # PASS
        end = torch.tensor(max_seq_len - pos_item - seqlen).item()
        after = cache.narrow(2, (pos_item + seqlen), end)

        return torch.cat([before, update, after], dim=2)

repro = Repro()

bsz = 1
n_heads = 4
max_seq_len = 512
head_dim = 64
seqlen = 5
pos_item = 1

cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim)
update = torch.ones(bsz, n_heads, seqlen, head_dim)
pos = torch.tensor([pos_item])
example_inputs = (cache, update, pos)

torch.export.export(repro, example_inputs)
```

This is what it now prints out

```
class GraphModule(torch.nn.Module):
    def forward(self, L_cache_: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", L_update_: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", L_pos_: "i64[1][1]cpu"):
        l_cache_ = L_cache_
        l_update_ = L_update_
        l_pos_ = L_pos_

         # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0
        getitem: "i64[][]cpu" = l_pos_[0];  l_pos_ = None
        item: "Sym(u0)" = getitem.item();  getitem = None

         # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        add: "Sym(u0 + 5)" = item + 5
        le: "Sym(u0 + 5 <= 512)" = add <= 512;  add = None
        _check = torch._check(le);  le = _check = None

         # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0)
        ge: "Sym(u0 >= 0)" = item >= 0
        _check_1 = torch._check(ge);  ge = _check_1 = None

         # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item)
        before: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = l_cache_.narrow(2, 0, item);  before = None

         # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
        add_1: "Sym(u0 + 5)" = item + 5
        sub: "Sym(512 - u0)" = 512 - item;  item = None
        sub_1: "Sym(507 - u0)" = sub - 5;  sub = None
        narrow_1 = l_cache_.narrow(2, add_1, sub_1);  l_cache_ = add_1 = sub_1 = narrow_1 = None

Traceback (most recent call last):
  File "/data/users/bobren/a/pytorch/torch/_dynamo/utils.py", line 3075, in run_node
    return getattr(args[0], node.target)(*args[1:], **kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl
    decomposition_table[func](*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward
    return self.as_strided(sizes, strides, storage_offset)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl
    entry = self._make_cache_entry(state, key, func, args, kwargs, output)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry
    output_info = self._get_output_info_for_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry
    synth_output = self._output_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry
    return self._get_output_tensor_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry
    empty.set_(storage, storage_offset, shape, stride)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious
    r = self.shape_env.evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr
    return self._evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)).  (Size-like symbols: u0)

Caused by: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))  # r1.py:21 in forward (utils/_stats.py:27 in wrapper)
For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146296
Approved by: https://github.com/zou3519
ghstack dependencies: #146298
2025-02-04 19:12:39 +00:00
8f861a8dfb [experimental] filter logs by subgraph (#146047)
```
TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[1/0]" python r4.py
```

```
TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[0/0],[1/0_1]" python r4.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146047
Approved by: https://github.com/laithsakka
2025-02-04 19:11:44 +00:00
7d60235aa6 [Metal] Small speedup for sum/prod (#146428)
As they can not really be invoked over empty arrays
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146428
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #146423
2025-02-04 19:10:33 +00:00
b1663b31e1 [Metal][BE] Add #pragma once to all headers (#146423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146423
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-04 19:10:33 +00:00
292af3cc89 [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408)
Apply ruff rule about implicit string concatenation, this autofixes strings that are all the same type and on the same line. These lines are broken up likely as the result of autoformatters in the past. All fixes are automated using the autofixes in ISC001.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146408
Approved by: https://github.com/justinchuby, https://github.com/janeyx99
2025-02-04 19:07:04 +00:00
f38a2ea0d4 [Dynamo] Better unsupported message for Fake Tensor Exception (#146357)
I cannot repro this. But this line shows up in internal logs, and I want
to know what the exception is and the context inside it. All of the
exceptions_allowed_to_be_fallback are dataclasses, so they should print
nicely.

Test Plan:
- code reading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146357
Approved by: https://github.com/williamwen42
2025-02-04 18:52:11 +00:00
b0fe975521 [hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.

This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
2025-02-04 18:47:34 +00:00
157d81c201 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-02-04 18:23:24 +00:00
23fffb54d5 Use OrderedSet in _functorch/partitioners (#146102)
In an attempt to make partitioning more deterministic, change all sets in partitioners.py to OrderedSets. Note that this change does not fix the non-determinism we're seeing in the internal model. But let's at least eliminate this potential source of non-determinism before investigating any changes to the mincut approach?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146102
Approved by: https://github.com/oulgen
2025-02-04 17:43:07 +00:00
53759ccca8 [AOTI] Fix an unaligned memory access issue in mm_template (#146293)
Summary: Fixes a corner case in the Triton MM template, where the dimension M (dynamic size) can be smaller than BLOCK_M (similarly for the N dimenstion) can trigger unaligned memory access error.

Differential Revision: D69034578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146293
Approved by: https://github.com/chenyang78, https://github.com/jansel
2025-02-04 17:12:04 +00:00
87a63a9886 Add @nikitaved to torch.linalg CODEOWNERS/persons_of_interest (#141803)
As per title. I hope there is no objection :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141803
Approved by: https://github.com/albanD
2025-02-04 16:11:31 +00:00
e9f6e273e7 [inductor] Add typing to common.CSE (#145993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993
Approved by: https://github.com/yanboliang
ghstack dependencies: #145916
2025-02-04 16:05:39 +00:00
7a5239afd7 [inductor] Add typing to common.KernelArgs (#145916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916
Approved by: https://github.com/yanboliang
2025-02-04 16:05:39 +00:00
5d81bc3696 [MPSInductor] Implement prod reduction (#146396)
Mostly reusing `sum` reduction logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146396
Approved by: https://github.com/dcci
ghstack dependencies: #146369, #146370, #146380, #146389
2025-02-04 14:08:04 +00:00
bbe95341d9 [MPSInductor] Implement min and max reductions (#146389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146389
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370, #146380
2025-02-04 14:04:10 +00:00
106acf0eec Revert "[aoti] Assign proxy call args by name, and support default values. (#146263)"
This reverts commit 11f69808c64a65c68a4452250ba7719dcff27c78.

Reverted https://github.com/pytorch/pytorch/pull/146263 on behalf of https://github.com/atalman due to multiple build failures, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146263#issuecomment-2633828689))
2025-02-04 12:57:55 +00:00
e0f22e54e8 [ROCm][TunableOp] Support leading dimensions in TunableOp signature. (#146358)
This is a feature enhancement that:
- May improve performance by distinguishing GEMMs with different leading dimensions.
- Fix correctness issues reported by users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146358
Approved by: https://github.com/jeffdaily
2025-02-04 10:27:43 +00:00
cyy
3f63f2bced Use std::string_view in tests (#146120)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146120
Approved by: https://github.com/albanD
2025-02-04 09:51:36 +00:00
8444fe019a [export] Fix requires_grad deserialization (#146351)
Test Plan: CI

Differential Revision: D69072095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146351
Approved by: https://github.com/zhxchen17
2025-02-04 08:02:38 +00:00
bb4bd5f00b [Metal][BE] Fix the arguments of polygamma (#146382)
In the public API, order comes before input, while here they're
 reversed. Match for consistency (and make this less error prone).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146382
Approved by: https://github.com/jansel, https://github.com/malfet
2025-02-04 06:40:34 +00:00
54ceb7c565 [MPSInductor] Add support for sum reduction (#146380)
- Add `threadgroup_sum` template to `c10/metal/reduction_utils.h` that so far uses barrier to compute the reductions

TODOs:
 - Implement efficient reduction using cooperative functions such as `simd_shuffle_down`
 - Figure out how to merge several sum reduction together
 - Implement `reduction_store` that will only write results from the first thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146380
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370
2025-02-04 06:23:44 +00:00
cyy
1c16cf70c3 Apply ruff fixes to tests (#146140)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146140
Approved by: https://github.com/albanD
2025-02-04 05:41:01 +00:00
cyy
71e3575525 Remove unactivated test (#146233)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146233
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-04 05:26:04 +00:00
e68f5087d8 update _unsafe_set_version_counter to accept lists of tensors (#137921)
See the comment [here](https://github.com/pytorch/pytorch/issues/132014#issuecomment-2379547400) (cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @XilunWu @rec) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list.

I left the binding in pybind, and used a `std::vector`. if we **really** need to optimize overhead even further, we could write a manual cpython binding.

I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137921
Approved by: https://github.com/awgu, https://github.com/albanD
2025-02-04 04:51:11 +00:00
425aca40a4 Fix random crash in PyPer (#146327)
Summary: PyPer saw random crashes when writing into ET file. This DIFF is to check if the output file is in condition before writing into it, and catch the exception if something bad happens, instead of crashing.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D69065509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146327
Approved by: https://github.com/sraikund16
2025-02-04 04:50:40 +00:00
0c37c332da [export] Additionally save pytree namedtuple field names (#145956)
If a user passes in a namedtuple as an input, currently the input TreeSpec looks like: `TreeSpec(type=namedtuple, context=”class_fqn”, children_spec=[*, *])`

The user then saves the program containing this input TreeSpec. But what happens if they load it in a new environment where `class_fqn` now contains an additional field?

This means that the exported program is now expected to take in another input. But since those fields were not used in the original program, users should be able just drop those additional fields and the program will run successfully. This is needed/used in APS where they use unflattener's adapter to adapt the inputs based on the previously saved treespecs.

There are a couple of [solutions](https://docs.google.com/document/d/1V4ZSdy-8PUISWc8RqvGu3DU01BVegJhHHPWqa1Io7Eg/edit?tab=t.0) for how we can address this, but eventually we settled on saving a side table mapping namedtuple types to their list of field names, which can then be accessed by the adapter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145956
Approved by: https://github.com/zhxchen17
2025-02-04 04:42:30 +00:00
487400f47f [dynamo] Support functools.partial variables through inspect.signature (#146339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146339
Approved by: https://github.com/jansel
ghstack dependencies: #146322, #146116
2025-02-04 04:39:39 +00:00
9756c7d788 [benchmark] Remove ONNX (#146325)
ONNX exporter experiments in benchmark is obsolete and unmaintained. This PR removes it to unblock https://github.com/pytorch/pytorch/pull/146003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146325
Approved by: https://github.com/titaiwangms
2025-02-04 04:02:47 +00:00
a79d8f8ba4 [ROCm] Tune 3d tensor sums when not using fastest dimension (#146170)
Tune 3d tensor sums when not using fastest dimension.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146170
Approved by: https://github.com/jeffdaily
2025-02-04 04:02:16 +00:00
7997ecf809 [BE] reduce log spew from test_triton_kernels.py (#145895)
One of the tests in this file was setting `self._logging.set_logs(output_code=True)` - which would cause logs to be printed for the rest of the tests in this file.

This PR puts the log-setting in a context manager so that the old behavior is restored afterwards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145895
Approved by: https://github.com/nmacchioni
2025-02-04 03:44:23 +00:00
5f53889850 [dynamo][builtin-skipfiles-cleanup] Remove inspect (#146116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146116
Approved by: https://github.com/williamwen42, https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #146322
2025-02-04 03:36:07 +00:00
762a05b3b3 [DCP] Remove all-gather of state dict keys (#145998)
The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU.

Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang.

In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check.

Resolves #145965 (as a workaround).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998
Approved by: https://github.com/mhorowitz, https://github.com/fegin
2025-02-04 03:16:13 +00:00
7f796eb8b7 Revert "[inductor] Add typing to common.KernelArgs (#145916)"
This reverts commit 68cf36d5ab6165372160f65eb84e13d0f8dbc5dc.

Reverted https://github.com/pytorch/pytorch/pull/145916 on behalf of https://github.com/atalman due to Failing internally, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/145916#issuecomment-2632715678))
2025-02-04 03:07:12 +00:00
d3c7e4bb9c Revert "[inductor] Add typing to common.CSE (#145993)"
This reverts commit 8c657ae4be55c6133307ad278c1740af5db133a7.

Reverted https://github.com/pytorch/pytorch/pull/145993 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/145993#issuecomment-2632712384))
2025-02-04 03:04:01 +00:00
ecbc725fad Revert "[inductor] Finish typing common.py (#146225)"
This reverts commit 3a67c0e48d29578aeeaa872275e730020bb5cbc2.

Reverted https://github.com/pytorch/pytorch/pull/146225 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146225#issuecomment-2632709707))
2025-02-04 03:01:36 +00:00
0061eb5b70 Revert "[inductor] Refactor CSEProxy into global scope (#146226)"
This reverts commit 18380ab877711f2e651c69c78675f0d0b31d2ceb.

Reverted https://github.com/pytorch/pytorch/pull/146226 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146226#issuecomment-2632707618))
2025-02-04 02:58:50 +00:00
cyy
f397c72697 Remove NOLINTNEXTLINE (#146238)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146238
Approved by: https://github.com/albanD
2025-02-04 02:45:32 +00:00
5451c9b7c9 [MPSInductor] Add support for any reduction (#146370)
- Add `_new_accvar` function that creates a threadgroup variable
- As threadgroup variables can not be initialized in place, add explicit initialization for reduction var

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146370
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #146369
2025-02-04 02:45:03 +00:00
71179772cd [MPSInductor] Prep change for reduction support (#146369)
Add `group_pos` parameter as well as set `group_size` when invoking reduction kernels
Separates loads and stores and insert threadgroup barrier if reduction is in place

Should be a no-op right now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146369
Approved by: https://github.com/dcci, https://github.com/jansel
2025-02-04 02:38:07 +00:00
3dcbd04d1d [cutlass backend] Add instantiation level for generating configs (#146230)
Passing through instantiation level to generate more configs.

I do see some C++ compilation error. But running is fine. Using 2222 generates 1k+ configs.

Differential Revision: D68989194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146230
Approved by: https://github.com/Chillee, https://github.com/mlazos
2025-02-04 02:36:04 +00:00
0e49f35e3d Integrate sympy expression provenance logging with structured logs (#145848)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145848
Approved by: https://github.com/angelayi
2025-02-04 01:21:37 +00:00
4168982dad PEP585: .github release triggers (#145708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145708
Approved by: https://github.com/malfet
2025-02-04 01:02:46 +00:00
cf6c5b8fa8 [mps/inductor] Adjust more tests that expect float64 as input. (#146366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146366
Approved by: https://github.com/malfet
2025-02-04 00:48:02 +00:00
2f40f789da Revert "[inductor] Refactor op handlers part 1 (#146235)"
This reverts commit 204be4e0a2e4509bd2457bfb295c429dd92c241f.

Reverted https://github.com/pytorch/pytorch/pull/146235 on behalf of https://github.com/atalman due to Breaks lint, sorry: Definition of polygamma in base class MetalOverrides is incompatible with definition in base class OpsHandler. Please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/146235#issuecomment-2632444514))
2025-02-04 00:00:08 +00:00
3aeccf2a28 DeepSpeed github repo move sync (#146320)
DeepSpeed has moved to a new repo on github https://github.com/deepspeedai/DeepSpeed

This PR updates this repo to use the new URL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146320
Approved by: https://github.com/awgu
2025-02-03 23:20:49 +00:00
204be4e0a2 [inductor] Refactor op handlers part 1 (#146235)
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.

Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%

WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%

WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
2025-02-03 23:15:13 +00:00
18380ab877 [inductor] Refactor CSEProxy into global scope (#146226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226
Approved by: https://github.com/shunting314
ghstack dependencies: #146225
2025-02-03 23:15:13 +00:00
0bc036a9e9 use copy2d in h2d/d2h copy when possible (#146256)
A rewrite of #138964
In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964:
1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d
2) copy2d should record even for the host pinned memory, like the regular copy does
3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking

In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call.

Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added.
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256
Approved by: https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 23:07:54 +00:00
35af193408 [easy] Add type annotation for autotune_num_choices_displayed (#146323)
Test Plan: ci

Differential Revision: D69064447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146323
Approved by: https://github.com/ColinPeppler
2025-02-03 23:04:21 +00:00
0463cb6ca5 [mps/inductor] Add support for digamma(). (#146292)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146292
Approved by: https://github.com/malfet, https://github.com/jansel
2025-02-03 22:48:13 +00:00
178531c95e [ONNX] torch.onnx.export(dynamo=True) changes optimization to default (#146187)
Fixes #145897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146187
Approved by: https://github.com/justinchuby
2025-02-03 22:44:54 +00:00
d69c181d77 log out partial fx graph when guard on data dependent during non stirct tracing (#146298)
As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging.

The following code produces a data dependent error

```
import torch
from torch import nn

# UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)).  (Size-like symbols: u0)
class Repro(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, cache, update, pos):
        _, _, max_seq_len, _ = cache.shape
        _, _, seqlen, _ = update.shape

        pos_item = pos[0].item() # u0
        torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        torch._check(pos_item >= 0)
        before = cache.narrow(2, 0, pos_item)

        # FAIL
        # Laith: why can't we make unbacked expressions size-like?
        after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))

        # PASS
        end = torch.tensor(max_seq_len - pos_item - seqlen).item()
        after = cache.narrow(2, (pos_item + seqlen), end)

        return torch.cat([before, update, after], dim=2)

repro = Repro()

bsz = 1
n_heads = 4
max_seq_len = 512
head_dim = 64
seqlen = 5
pos_item = 1

cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim)
update = torch.ones(bsz, n_heads, seqlen, head_dim)
pos = torch.tensor([pos_item])
example_inputs = (cache, update, pos)

torch.export.export(repro, example_inputs, strict=False)
```

This is what it now prints out

```
class GraphModule(torch.nn.Module):
    def forward(self, arg0_1: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", arg1_1: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", arg2_1: "i64[1][1]cpu"):
         # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0
        select: "i64[][]cpu" = torch.ops.aten.select.int(arg2_1, 0, 0);  arg2_1 = None
        item: "Sym(u0)" = torch.ops.aten.item.default(select);  select = None

         # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        add: "Sym(u0 + 5)" = item + 5
        le: "Sym(u0 + 5 <= 512)" = add <= 512;  add = le = None

         # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0)
        ge: "Sym(u0 >= 0)" = item >= 0;  ge = None

         # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item)
        narrow: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = torch.ops.aten.narrow.default(arg0_1, 2, 0, item);  narrow = None

         # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
        add_1: "Sym(u0 + 5)" = item + 5
        sub: "Sym(512 - u0)" = 512 - item;  item = None
        sub_1: "Sym(507 - u0)" = sub - 5;  sub = None
        narrow_1 = torch.ops.aten.narrow.default(arg0_1, 2, add_1, sub_1);  arg0_1 = add_1 = sub_1 = narrow_1 = None

Traceback (most recent call last):
  File "/data/users/bobren/a/pytorch/r1.py", line 45, in <module>
    torch.export.export(repro, example_inputs, strict=False)
  File "/data/users/bobren/a/pytorch/torch/export/__init__.py", line 368, in export
    return _export(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper
    raise e
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper
    ep = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 2079, in _export
    return _export_for_training(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper
    raise e
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper
    ep = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1944, in _export_for_training
    export_artifact = export_func(  # type: ignore[operator]
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1879, in _non_strict_export
    aten_export_artifact = _to_aten_func(  # type: ignore[operator]
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1665, in _export_to_aten_ir_make_fx
    gm, graph_signature = transform(_make_fx_helper)(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1809, in _aot_export_non_strict
    gm, sig = aot_export(wrapped_mod, args, kwargs=kwargs, **flags)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1585, in _make_fx_helper
    gm = make_fx(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2194, in wrapped
    return make_fx_tracer.trace(f, *args)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2132, in trace
    return self._trace_inner(f, *args)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2103, in _trace_inner
    t = dispatch_trace(
  File "/data/users/bobren/a/pytorch/torch/_compile.py", line 51, in inner
    return disable_fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_dynamo/eval_frame.py", line 749, in _fn
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1136, in dispatch_trace
    graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1692, in trace
    res = super().trace(root, concrete_args)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 834, in trace
    (self.create_arg(fn(*args)),),
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1191, in wrapped
    out = f(*tensors)  # type:ignore[call-arg]
  File "<string>", line 1, in <lambda>
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1488, in wrapped_fn
    return tuple(flat_fn(*args))
  File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/utils.py", line 184, in flat_fn
    tree_out = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 879, in functional_call
    out = mod(*args[params_len:], **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module
    return Tracer.call_module(self, m, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module
    ret_val = forward(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1793, in forward
    tree_out = mod(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module
    return Tracer.call_module(self, m, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module
    ret_val = forward(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/r1.py", line 21, in forward
    after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1239, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1286, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_export/non_strict_utils.py", line 654, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_ops.py", line 866, in handler
    return torch._library.utils.handle_dispatch_mode(
  File "/data/users/bobren/a/pytorch/torch/_library/utils.py", line 296, in handle_dispatch_mode
    return curr_mode.__torch_dispatch__(op_overload, overload_types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1341, in __torch_dispatch__
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 910, in proxy_call
    out = func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_ops.py", line 749, in __call__
    return self._op(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl
    decomposition_table[func](*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward
    return self.as_strided(sizes, strides, storage_offset)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl
    entry = self._make_cache_entry(state, key, func, args, kwargs, output)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry
    output_info = self._get_output_info_for_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry
    synth_output = self._output_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry
    return self._get_output_tensor_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry
    empty.set_(storage, storage_offset, shape, stride)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious
    r = self.shape_env.evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr
    return self._evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)).  (Size-like symbols: u0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146298
Approved by: https://github.com/bdhirsh
2025-02-03 22:16:03 +00:00
0da07a6d1d [dynamo][skip-function] Add missing unimplemented line (#146322)
This is a missing line from the merged PR in the stack below. Lets try to get this in quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146322
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos
2025-02-03 22:11:55 +00:00
00dc5b10f6 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 2fd1b6b3610eb84cd615360a8fd23756a7f2e743.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/atalman due to Breaks executorch tests ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2632202864))
2025-02-03 22:04:28 +00:00
15e12d5ec3 [Trace PyDispatcher] Support temporarily_pop_interpreter_stack ctx manager (#146271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146271
Approved by: https://github.com/zou3519
ghstack dependencies: #146270
2025-02-03 21:47:54 +00:00
bd8d7b1b74 [Dynamo][Trace PyDispatcher] Remove disable from HigherOrderOperator.__call__ (#146270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146270
Approved by: https://github.com/zou3519
2025-02-03 21:47:54 +00:00
fd73ae2068 [Utilization] Convert timestamp to str for datetime64 (#145985)
Convert all timestamp(float) to int  timestamp during data pipeline for db type datetime64.
float does not work when try to insert into clickhouse using jsonExtract.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145985
Approved by: https://github.com/huydhn
2025-02-03 21:05:18 +00:00
1d4adf4e1f [dynamo] log recompile reason to dynamo_compile (#146117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146117
Approved by: https://github.com/bobrenjc93
2025-02-03 21:04:04 +00:00
11f69808c6 [aoti] Assign proxy call args by name, and support default values. (#146263)
Fixing the following issue when compiling the following program:
```
                window = torch.hann_window(N_FFT).to(x.device)
                stft = torch.stft(
                    x, N_FFT, HOP_LENGTH, window=window, return_complex=True
                )
                magnitudes = stft[..., :-1].abs() ** 2
                return magnitudes
```
```
Traceback (most recent call last):
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
    self.check_model(model, example_inputs)
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
    actual = AOTIRunnerUtil.run(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
    optimized = AOTIRunnerUtil.load(device, so_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
    return torch._export.aot_load(so_path, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
2025-02-03 20:15:59 +00:00
e67ce67498 [cutlass backend] update try_import_cutlass to accomodate for pip install (#145891)
The goal of this PR is to provide 3 ways for people to try out CUTLASS backend:
1. fbcode / internal
2. pip install torch (nightly) and pip install nvidia-cutlass
3. build from source

I will go into more detailed combos between building from source and downloading via pip for torch and cutlass.

repro:
```
import torch
import torch.nn as nn

import torch._inductor.config as config

config.force_disable_caches = True
config.max_autotune = True
config.max_autotune_gemm_backends = "CUTLASS"
# the following is only needed if you use a custom cutlass library
# config.cuda.cutlass_dir = "/data/users/henrylhtsang/cutlass"

class TestModule(nn.Module):
    def forward(self, A, B):
        return A @ B

model = TestModule().cuda()
M, K, N = 2048, 2048, 2048
A = torch.randn(M, K).cuda().half()
B = torch.randn(K, N).cuda().half()

C = torch.compile(model, fullgraph=True)(A, B)
```

## pre-requisite
Assuming you have the right cuda toolkit. Recommend 12.4. Make sure PATH, LD_LIBRARY_PATH and CUDA_NVCC_EXECUTABLE are good.

## combo 1: pip install torch + pip install nvidia-cutlass
Check https://pytorch.org/get-started/locally/ for **nightly** install command.
```
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
pip install nvidia-cutlass
```
Then try running the script above. It should work.

## combo 2: build torch from source + pip install nvidia-cutlass
This is going to be be pretty straightforward. Just keep in mind that even though pytorch/third_party/cutlass exists, the one that will be used is the pip package, so mindful of version differences.

## combo 3: build torch from source + use pytorch/third_party/cutlass
This is how most pytorch devs would do it. Just make sure you don't have a cutlass pip package installed, i.e., make sure `import cutlass_library` would fail on its own.

## combo 4: any torch version + cutlass library from somewhere else
This is probably the only case you need to pass in cutlass_dir. Just set cutlass_dir to the cutlass repo library. The expectations is that cutlass_dir is the directory that contains include, tool, and python/cutlass_library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145891
Approved by: https://github.com/Chillee, https://github.com/ColinPeppler
2025-02-03 20:05:41 +00:00
f237172768 Fix not inlining functions used in metal files (#146316)
Fixes issue when building PyTorch with Xcode installed after https://github.com/pytorch/pytorch/pull/146231
```
FAILED: caffe2/aten/src/ATen/kernels_basic.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_basic.metallib
cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_basic.metallib BinaryKernel_30.air Bucketization_30.air CrossKernel_30.air FusedOptimizerOps_30.air Gamma_30.air HistogramKernel_30.air Im2Col_30.air Indexing_30.air LinearAlgebra_30.air Quantized_30.air RMSNorm_30.air RenormKernel_30.air Repeat_30.air SpecialOps_30.air TriangularOps_30.air UnaryKernel_30.air UnfoldBackward_30.air UpSample_30.air
LLVM ERROR: multiple symbols ('_ZN3c105metal4zetaEff')!
[3835/5420] Building CXX object c10/test/CMakeFiles/c10_small_vector_test.dir/util/small_vector_test.cpp.o
ninja: build stopped: subcommand failed.
```

AI to @malfet: Add linter that ensures that `c10/metal/` headers do not have any functions there, only templates
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146316
Approved by: https://github.com/malfet, https://github.com/atalman
2025-02-03 19:33:52 +00:00
674e0b668a Add non-strict export while_loop test back (#146195)
This is fixed by https://github.com/pytorch/pytorch/pull/145762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146195
Approved by: https://github.com/zou3519
ghstack dependencies: #146194
2025-02-03 19:28:22 +00:00
1138d0c4f6 [hop] enable while_loop return torch.ones with unbacked symbol expression. (#146194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146194
Approved by: https://github.com/zou3519
2025-02-03 19:28:22 +00:00
57b1fc35f6 [dynamo] Disable compiling on elementwise_type_promotion_wrapper (#146219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146219
Approved by: https://github.com/zou3519
ghstack dependencies: #146075, #146283
2025-02-03 18:02:48 +00:00
64fc9ff09c Revert "[ONNX] Create deprecation warning on dynamo_export (#146003)"
This reverts commit e6c39d37e90242692cf25ea849abd47d11932cd7.

Reverted https://github.com/pytorch/pytorch/pull/146003 on behalf of https://github.com/atalman due to Broke internally ([comment](https://github.com/pytorch/pytorch/pull/146003#issuecomment-2631599314))
2025-02-03 17:17:14 +00:00
041e08f9dc Add buffers to parameterizaiton rule (#145991)
Differential Revision: [D68959513](https://our.internmc.facebook.com/intern/diff/D68959513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145991
Approved by: https://github.com/bdhirsh
2025-02-03 16:49:03 +00:00
c0979d72b5 Revert "[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)"
This reverts commit 68a363548409a3ff17965770304ee5e12fe718d9.

Reverted https://github.com/pytorch/pytorch/pull/143456 on behalf of https://github.com/atalman due to New tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/143456#issuecomment-2631475900))
2025-02-03 16:25:58 +00:00
01554c7b5a fix incorrect literal strings / accidental tuples (#146037)
* `expr,` is short for `(expr,)`
* literal strings over multiple lines need to escape the newline `\` or use `(...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146037
Approved by: https://github.com/Skylion007
2025-02-03 15:08:11 +00:00
550441a87b Update slow tests (#146301)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146301
Approved by: https://github.com/pytorchbot
2025-02-03 11:37:16 +00:00
08b14936ae Disable has_relational_guards check for dict_tag optimization for now (#146232)
has_relational_guards evaluates to true almost always, and leads to a
slowdown in guards runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146232
Approved by: https://github.com/anijain2305
2025-02-03 07:56:06 +00:00
e3643e1e0e [MPS] Add linalg det and fix lu factor for non contiguous tensors (#146279)
Requested in #77764

This PR adds support for linalg.det on MPS and fixes lu factor for non contiguous tensors, current implementation crashed on any kind of non-contiguous tensor with an error:
```
-[AGXG13XFamilyCommandBuffer blitCommandEncoderCommon:]:833: failed assertion `A command encoder is already encoding to this command buffer'
zsh: abort      python det.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146279
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 06:06:43 +00:00
1580f47bf4 [export][ez] Fix generated header file. (#146208)
Summary: as title.

Test Plan: CI

Differential Revision: D68978788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146208
Approved by: https://github.com/yiming0416
2025-02-03 06:01:05 +00:00
cyy
7b512095ef Enable some tests on MacOS (#146268)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146268
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-03 05:04:24 +00:00
fa48757180 [dynamo] misc fixes for inspect (#146283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146283
Approved by: https://github.com/jansel
ghstack dependencies: #146075
2025-02-03 04:26:10 +00:00
cyy
6ac8bc0cd2 Remove unused import in tests (#146266)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146266
Approved by: https://github.com/Skylion007
2025-02-03 03:40:18 +00:00
d80eef7c6d [inductor] Guard a member variable with a define. (#146278)
It's unused otherwise, and when running MPS tests, I get a bunch of warnings of this kind:

/Users/davidino/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:412:10: warning: private field 'blob_size_' is not used [-Wunused-private-field]
  412 |   size_t blob_size_;
      |          ^
1 warning generated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146278
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-02-03 02:20:08 +00:00
c0ec2e0a0d [dynamo][functions] Improve getattr on functions (#146075)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146075
Approved by: https://github.com/jansel
2025-02-03 02:01:57 +00:00
d28fe3ed47 [metal] Move digamma to special_math.h (#146284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146284
Approved by: https://github.com/jansel
2025-02-03 01:29:14 +00:00
1f21f699ba [metal] Refactor digamma in preparation for moving it. (#146281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146281
Approved by: https://github.com/jansel
2025-02-02 23:54:45 +00:00
511d0dd558 [Dynamo][Trace PyDispatcher] Support calling id function over class (#146269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146269
Approved by: https://github.com/anijain2305
2025-02-02 22:29:30 +00:00
02fd4868d6 Fix unreachable code (#146262)
Fixes #146261

Removed unreachable code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146262
Approved by: https://github.com/Skylion007
2025-02-02 21:35:26 +00:00
5d55a6585d [MPS] lu factor ex implementation (#144651)
Implements `torch.linalg.lu_factor_ex`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144651
Approved by: https://github.com/malfet
2025-02-02 15:09:49 +00:00
0144613e6f move and fix logic to update unbacked bindings (#146115)
Summary:
Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here).

This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde.

Test Plan: added test

Differential Revision: D68880766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115
Approved by: https://github.com/pianpwk
2025-02-02 10:43:55 +00:00
a44a8a7d3a [audio hash update] update the pinned audio hash (#145988)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145988
Approved by: https://github.com/pytorchbot
2025-02-02 04:19:29 +00:00
cyy
8543d8395b [2/N] Enable ruff F841 on distributed tests (#146132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146132
Approved by: https://github.com/Skylion007, https://github.com/rec
2025-02-02 03:44:48 +00:00
cef856faa9 [dynamo][enum] Trace through enum.py for enum construction (#146070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146070
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198, #146258, #146214
2025-02-02 03:12:36 +00:00
31fb691782 [dynamo] Graph break on tensor.retain_grad (#146214)
Fixes https://github.com/pytorch/pytorch/issues/146212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146214
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198, #146258
2025-02-02 03:12:36 +00:00
529eb8d558 [dynamo] Add return to python_type (#146258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146258
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198
2025-02-02 03:12:36 +00:00
7854299b27 [mps/inductor] Implement support for polygamma(). (#146259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146259
Approved by: https://github.com/jansel
2025-02-02 01:54:23 +00:00
d89c7ea401 add WaitCounter type interface and get rid of type errors (#146175)
Summary: as titled

Differential Revision: D68960123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146175
Approved by: https://github.com/andriigrynenko, https://github.com/Skylion007
2025-02-01 23:24:52 +00:00
3a67c0e48d [inductor] Finish typing common.py (#146225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225
Approved by: https://github.com/Skylion007
2025-02-01 22:53:35 +00:00
dca5cc0255 [mps] Move polygamma to special_math.h. (#146253)
In preparation to implement it in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146253
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 21:45:23 +00:00
07dbd539b4 [BE][Ez]: Make c10/special arrays constexpr (#146246)
No reason to have array creation overhead for these constexpr arrays. This is better because it guarantees the array is not duplicated across templates or translation units unless necessary and allows the compiler to do static compile time bounds checking (even in loop based accesses)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146246
Approved by: https://github.com/dcci, https://github.com/malfet
2025-02-01 21:03:18 +00:00
d4ad7b91ad [mps] Move zeta() to special_math.h. (#146231)
In preparation for implementing digamma/polygamma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 19:22:59 +00:00
f97307f463 [Docs] Add clarification for target types in CrossEntropyLoss doc (#145444)
CrossEntropyLoss function requires that target for class indices are provided as a long and class probabilities are provided as a float datatype.

The CrossEntropyLoss function distinguish the two scenarios (indices and probabilities) by comparing the shapes. When input and target shapes are the same it’s a case for probabilities otherwise it will be used as a class index as already covered in the doc. The related code is here,
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624

I think the current documentation is great but seems like it can confuse users about types as reported in the issues so this PR adds a bit more clarification.

Fixes #137188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145444
Approved by: https://github.com/mikaylagawarecki
2025-02-01 18:55:58 +00:00
5ed5793016 Temp disable MKL in DistributionKernels.cpp (#146174)
Until https://github.com/pytorch/pytorch/issues/132395 is addressed

Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential )
```python
import torch

high_bits_for_seed = 16000000000000000000           # to use "good quality" seed
_ = torch.manual_seed (high_bits_for_seed + 2024)

prob = torch.ones (26)
dups_mult = 0
perm_counts_mult = {}
for _ in range (1_000_000):
    p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist())
    if  p in perm_counts_mult:
        dups_mult += 1
        perm_counts_mult[p] += 1
    else:
        perm_counts_mult[p] = 1

print ('duplicate multinomial perms: ', dups_mult)
print ('multiple multinomial perms:  ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item())
print ('max of perm_counts_mult:     ', torch.tensor (list (perm_counts_mult.values())).max().item())
print ('len (perm_counts_mult):      ', len (perm_counts_mult))
```

This is a reland of https://github.com/pytorch/pytorch/pull/132532 but excluding internal builds that already has some hardcoded values

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146174
Approved by: https://github.com/ngimel
2025-02-01 18:53:11 +00:00
e56dcf2772 [CPUInductor] Fix SVE256 detection (#146207)
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes https://github.com/pytorch/pytorch/issues/145441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207
Approved by: https://github.com/angelayi
2025-02-01 18:51:34 +00:00
8c657ae4be [inductor] Add typing to common.CSE (#145993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914, #145915, #145916
2025-02-01 16:34:18 +00:00
68cf36d5ab [inductor] Add typing to common.KernelArgs (#145916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914, #145915
2025-02-01 16:34:18 +00:00
8e56d713c9 [inductor] Add typing to common.OpDecompositions (#145915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145915
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914
2025-02-01 16:34:11 +00:00
79f9f62e3a [inductor] Combine regexp checks in OpOverrides.paren (#145914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145914
Approved by: https://github.com/Skylion007
ghstack dependencies: #145913
2025-02-01 16:34:03 +00:00
4c004caa76 [inductor] Add types to DeviceOpOverrides (#145913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145913
Approved by: https://github.com/Skylion007
2025-02-01 16:33:49 +00:00
0f768c7866 Barebones flat_apply HOP (#146060)
This PR:
- adds pytree.register_constant for registering a class to be treated as
  a constant by torch.compile/torch.fx
- adds a very barebones flat_apply HOP. This should be sufficient to get
  mark_traceable working. A lot more work is necessary to get the custom
  operator case working (when make_fx sees a custom operator with PyTree
  arg types, it needs to emit a call to the flat_apply HOP).
- I expect the flat_apply HOP to change a lot, I want to ship this in
  the current state to unblock the mark_traceable and custom ops
  work.

Test Plan:
- It's kind of difficult to test the barebones flat_apply HOP "works" so
  I added a really simple test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146060
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang
ghstack dependencies: #146059
2025-02-01 16:17:48 +00:00
373606928b Add torch.utils._pytree.register_dataclass (#146059)
This is an API that registers a dataclass as a pytree node.
It directly calls torch.export.register_dataclass, but we should
eventually inline that implementation here. I want to use this API for
something in compile and feel weird calling
torch.export.register_dataclass.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146059
Approved by: https://github.com/StrongerXi, https://github.com/angelayi, https://github.com/yanboliang
2025-02-01 16:17:48 +00:00
cyy
2fd1b6b361 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-02-01 12:33:41 +00:00
2b00d211f0 Build RowwiseScaledMM.cu for SM89 (#145676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676
Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy
2025-02-01 11:44:58 +00:00
f40e013787 Fix aten.to when input is a tensor constant (#146220)
Summary:
Fix aten.to when input is a tensor constant.

In this case, `args_unwrapped` could just be a constant, so not a functional tensor.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  tensor_constant_aten_to
```

Differential Revision: D68984244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146220
Approved by: https://github.com/JacobSzwejbka
2025-02-01 11:07:33 +00:00
30f091da44 add speculation log divergence test (#145659)
Followup from a SEV. Confirmed that this breaks when stacked on top of https://github.com/pytorch/pytorch/pull/145660 (offending PR that caused the SEV)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145659
Approved by: https://github.com/laithsakka
2025-02-01 09:39:22 +00:00
a4e4368157 add node mapping processing (#146103)
Summary:
Add `node_mapping = create_node_mapping(pre_grad_graph_id, inductor_post_to_pre_grad_nodes, debug_info)`, to produce a `inductor_provenance_tracking_node_mappings.json` file. This file will be used by the provenance tracking highlighter tool to create provenance visualization.

`inductor_triton_kernel_to_post_grad_nodes.json` and `inductor_provenance_tracking_node_mappings.json` files are not dumped if they are both empty. So it's removed from some of the `test_structured_trace` tests.

Test Plan:
CI
```
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance

buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing

python test/dynamo/test_structured_trace.py
```

Differential Revision: D68190173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146103
Approved by: https://github.com/chenyang78
2025-02-01 08:29:29 +00:00
f38d5b4a74 Update TorchBench commit to main (#145455)
I'm adding sam2 to TorchBench https://github.com/pytorch/benchmark/issues/2566, so, as part of that, I'm updating PyTorch CI to use latest TorchBench commit.

The corresponding change from TorchBench is https://github.com/pytorch/benchmark/pull/2584

The main thing to call out that the newer transformers added by https://github.com/pytorch/benchmark/pull/2488 is regressing several models. This needs to be investigated further, and I pin the version to unblock this change.

* `hf_Roberta_base` a new model added by https://github.com/pytorch/benchmark/pull/2279, not sure why it fails accuracy on A10G, but it works fine on A100
* `speech_transformer` failures are pre-existing trunk failures, i.e. https://github.com/pytorch/pytorch/actions/runs/13040114684/job/36380989702#step:22:2408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145455
Approved by: https://github.com/kit1980
2025-02-01 06:44:26 +00:00
a97a906dd9 Add "//caffe2:libtorch" to minifier TARGET file (#146203)
Summary: as title. To avoid errors like "undefined symbol: aoti_torch_device_type_cpu" when compiling minifier_launcher.py

Test Plan: CI

Differential Revision: D68978430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146203
Approved by: https://github.com/desertfire
2025-02-01 05:37:23 +00:00
bcd0ba0f69 Adding the best autotuner config (#146121)
Summary: Adding logs to log the best config for autotune configs

Test Plan:
Testing in Mast : aps-omnifmv1-5_32_test_with_best_config-c5e9ceccf8

 {F1974838864}

Reviewed By: oulgen

Differential Revision: D68931164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146121
Approved by: https://github.com/oulgen
2025-02-01 03:43:33 +00:00
549e230c33 [draft_export] Clear pending unbacked symbols when overriding mismatched fake kernels (#146089)
Summary:
When encountering a mismatched fake kernel that also creates unbacked symbols, draft export will fail with `PendingUnbackedSymbolNotFound` error.

Clearing `shape_env.pending_fresh_unbacked_symbols` fixes this issue.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_override_mismatched_fake_kernel_with_unbacked_symbols
```

Differential Revision: D68920990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146089
Approved by: https://github.com/pianpwk
2025-02-01 03:32:50 +00:00
cyy
4d2056efb5 Enable ruff F841 on numpy tests (#146126)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146126
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-01 03:07:28 +00:00
cyy
985a78e9df Enable ruff F841 on distributed tests (#146131)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146131
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-01 03:06:16 +00:00
1de41e6918 [dynamo][exceptions][3.10] Clean symbolic stack on exception handling (#146198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146198
Approved by: https://github.com/williamwen42
ghstack dependencies: #146062
2025-02-01 02:51:44 +00:00
6023684311 [export] Fix symfloat serialization (#146112)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146112
Approved by: https://github.com/pianpwk
2025-02-01 02:28:44 +00:00
8326d27093 [inductor][5/N] triton support post-#5512, fix 1 and None handling (#145515)
This fixes handling for "1" and "None" args with new Triton versions. TL;DR: triton_meta["constants"] (which is passed to ASTSource) should be a map of {"kwarg_name": constant_value} for values which are tl.constexpr, or have a value of 1 or None (i.e. "specialized" constants). For constant args, triton_meta["signature"][arg_name] should be "constexpr" (even for specialized constants).

Note: This adds support for Triton versions after 5512; but not for versions in between 5220 and 5512 (i.e. `TritonAttrsDescriptorVersion.V3_BACKENDS_TUPLE`). There's a completely different format for constants/signature in the commit range in between.

To test: I ran `test_torchinductor.py` and `test_triton_kernels.py` with the main branch of triton (~jan 27). The only failing tests are aoti-related tests (which need to be fixed as a follow-up), and test_mutable_custom_op_fixed_layout2_cuda (which is failing with or without the new triton version on my machine); additionally, the split-scan/split-reduction kernels rely on https://github.com/triton-lang/triton/pull/5723.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145515
Approved by: https://github.com/SamGinzburg
2025-02-01 02:11:48 +00:00
6e734bab93 execution trace export supports gzip format (#146179)
As above, allows Chakra Execution Trace observer to support compressing files.
Usage is straightforward, just add ".gz" suffix to the output file name
```
et = ExecutionTraceObserver()
et.register_callback("my_trace.json.gz")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146179
Approved by: https://github.com/shengfukevin, https://github.com/davidberard98, https://github.com/sraikund16
2025-02-01 01:25:25 +00:00
57c45340e7 include entire GraphModule instead of current node when erroring inside of fx interpreter (#146197)
This seems like it would make it easier to diagnose PT2 issues where the user cannot easily repro, and we need more info in the backtrace, e.g. in https://github.com/pytorch/pytorch/issues/134182#issuecomment-2628076114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146197
Approved by: https://github.com/jamesjwu
2025-02-01 01:09:27 +00:00
73d90d66a4 Cap size of thread pool in select_algorithm to cpu count (#146071)
Summary: With changes from https://github.com/pytorch/pytorch/pull/144829, we can see more autotune configs and the size of the pool can get outta hand when using the cutlass backend.

See internal discussion at: https://fburl.com/workplace/7g4vz0zy

Test Plan: `python test/inductor/test_cutlass_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146071
Approved by: https://github.com/Chillee
2025-02-01 00:41:36 +00:00
cde5ddfd14 fix internal error with reorder submodules (#146181)
Test Plan: hard to isolate as small repro

Differential Revision: D68963033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146181
Approved by: https://github.com/angelayi
2025-02-01 00:30:42 +00:00
35f113e2a0 torch/nn/utils/rnn.py: docs: improvements (#138628)
Fix constants highlighting in generated documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138628
Approved by: https://github.com/mikaylagawarecki
2025-02-01 00:10:30 +00:00
a78c796f0b [AOTI] Support composed dynamic shape constraint (#146044)
Summary: Fixes https://github.com/pytorch/pytorch/issues/145500. When export takes a dynamic shape constraint as an expression containing a symbol, we should be able to solve the symbol at run time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146044
Approved by: https://github.com/angelayi
ghstack dependencies: #146043
2025-02-01 00:02:12 +00:00
43372e70c2 ehnace logging statically known by adding size_oblivious(..) (#145354)
after the diff
```
[0/0_1] eval size_oblivious(Eq(s1, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(u0, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(s0, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(s0*s1*u0, 0)) == False [statically known]
```
before
```
[0/0_1] eval (Eq(s1, 1)) == False [statically known]
[0/0_1] eval (Eq(u0, 1)) == False [statically known]
[0/0_1] eval (Eq(s0, 1)) == False [statically known]
[0/0_1] eval (Eq(s0*s1*u0, 0)) == False [statically known]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145354
Approved by: https://github.com/ezyang
2025-01-31 23:26:37 +00:00
f25f1163dc [dynamo] Support frozenset({..}).__contains__ (#146062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146062
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-01-31 23:22:58 +00:00
eb029fba13 [c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789) (#144794)
Try to land https://github.com/pytorch/pytorch/pull/136789/files on our end and fix any remaining issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144794
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman
2025-01-31 22:39:56 +00:00
af2a39849d [AOTI] Refactor codegen_input_symbol_assignment (#146043)
Summary: Extract the common logic for size and stride symbol generation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146043
Approved by: https://github.com/angelayi
2025-01-31 21:55:18 +00:00
c39c679813 Revert "Tensor .cuda() very slow with specific array sizes (#138964)"
This reverts commit 98f87edd233ea69cee5f3e73e9eb4b5ab77aa744.

Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))
2025-01-31 21:48:51 +00:00
a7cc6d3e84 Manylinux 2.28 migration - remove pre-cxx11 abi libtorch builds (#146200)
Related to: https://github.com/pytorch/pytorch/issues/123649
Removing pre-cxx11 abi builds.
As per announcement : https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146200
Approved by: https://github.com/kit1980, https://github.com/huydhn
2025-01-31 21:43:12 +00:00
8203894eff Resolve affine quantization namespace collision with torchao (#145941)
Summary:
https://github.com/pytorch/pytorch/pull/141421
duplicated affine quantization custom ops from torchao into
the PT2E quantization flow, but these ops are registered under
the same namespace with the same name, causing "Duplicate
registration" errors for the new ops for use cases that import
from both repos. This commit fixes this by moving the PT2E
versions of the ops to a new namespace. In the long term,
we expect to migrate PT2E into torchao so users can migrate
back to the old namespace if they wish to.

Test Plan: python test/test_quantization.py -k test_channel_group_quantization

Differential Revision: D68838437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145941
Approved by: https://github.com/cccclai
2025-01-31 21:29:47 +00:00
781aceee9c [dynamo] Revert abc change due to internal failures (#146177)
xref - https://www.internalfb.com/tasks/?t=191383874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146177
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146141
2025-01-31 21:28:06 +00:00
a0d1393b1a [MTIA][FSDP2] Enable MTIA device in FSDP2 library code (#145842)
Differential Revision: D68560256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145842
Approved by: https://github.com/chaos5958, https://github.com/nautsimon
2025-01-31 21:21:00 +00:00
06850e624a [ca][hop] test CA on all HOPs (#145429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145429
Approved by: https://github.com/zou3519
ghstack dependencies: #145422
2025-01-31 20:45:22 +00:00
2e197c8a2d [dynamo][hop] test torch.compiling all HOPs (#145422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145422
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-01-31 20:45:22 +00:00
5b1abdbf5d [dynamo] remove always-failing eval_frame.c debug check (#145982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145982
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145981
2025-01-31 20:40:59 +00:00
49df8de8be [dynamo] disable eval_frame callback in _TorchDynamoContext __enter__/__exit__ (#145981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145981
Approved by: https://github.com/jansel
2025-01-31 20:40:59 +00:00
3a4e7a589b [CI][Distributed] Fix edge case: One rank case (Rank 0) should get [False, False] (#146099)
To match the expected tensor (i.e. 2nd element in the array). Making rank0 receive [False, False]

Fixes one of the issues reported in #146094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146099
Approved by: https://github.com/eqy
2025-01-31 20:31:13 +00:00
8b8c596503 Remove trivial dispatch_key_allowlist_check function (#146169)
Hmmm...this _is_ removing a public function from a public C++ file. But the GH counts for this function total 83, seemingly all copying pytorch: https://github.com/search?q=dispatch_key_allowlist_check&type=code&p=1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146169
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-01-31 19:59:40 +00:00
ec2522e200 [MPS] optimize cholesky (#145722)
Followup to #145701

Optimizes the syrk and trsm kernels of cholesky decomposition on mps. For SYRK kernel it does matmuls with apple's simdgroup matrices instead of a tiled implementation and for trsm kernel we do vectorized loads. Also this PR puts command encoder inside of the stream queue dispatch (as discussed on last PR).

Script to collect perf
```
mport torch
import numpy as np
import time
import csv

matrix_sizes = [512, 1024, 2048, 4096]
batch_sizes = [1, 2, 4, 8, 16]
num_runs = 10
warmup_runs = 3

def create_spd_matrix(n, batch_size):
    torch.manual_seed(42)
    A = torch.randn(batch_size, n, n, dtype=torch.float32)
    return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1)

def run_cholesky_mps(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    b = torch.linalg.cholesky(A, upper=False)
    torch.mps.synchronize()
    end = time.perf_counter()
    return b, end - start

results = {
    'N': [],
    'batch_size': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    for batch_size in batch_sizes:
        print(f"\nBenchmarking N={n}, batch_size={batch_size}")

        try:
            A_cpu = create_spd_matrix(n, batch_size)
            A_mps = A_cpu.to("mps")

            for _ in range(warmup_runs):
                _, _ = run_cholesky_mps(A_mps)

            times = []
            for _ in range(num_runs):
                _, t = run_cholesky_mps(A_mps)
                times.append(t)

            mean_time = np.mean(times)
            std_time = np.std(times)

            results['N'].append(n)
            results['batch_size'].append(batch_size)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)

            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

        except RuntimeError as e:
            print(f"Error for N={n}, batch_size={batch_size}: {e}")
            continue

with open('cholesky_benchmark_times.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch_size', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['batch_size'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```

Observed speedups on M1 Pro
![cholesky_speedup](https://github.com/user-attachments/assets/be3edb1a-8b4a-4039-9d7f-9b9a10f1c83a)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145722
Approved by: https://github.com/malfet
2025-01-31 19:52:31 +00:00
6a0138fcc1 Torch device backend autoload fix (#145611)
This causes an import failure if an external backend imports a module that uses `torch._as_tensor_fullprec` when it is being loaded.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145611
Approved by: https://github.com/albanD
2025-01-31 19:27:42 +00:00
cyy
18380836eb Remove outdated test skipif conditions for Python3.9 (#146144)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146144
Approved by: https://github.com/albanD
2025-01-31 19:01:04 +00:00
68a3635484 [hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.

This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
2025-01-31 18:29:27 +00:00
aad9f44b2e [export] Sync model container types to schema.py (#145959)
Summary: Synced from D68840230

Test Plan: No behavior changes to existing API. Will be tested internally.

Differential Revision: D68846532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145959
Approved by: https://github.com/yiming0416
2025-01-31 18:17:56 +00:00
16f44fee25 Revert "[inductor/profiler] add kernel kwargs instrumentation (#145573)"
This reverts commit 720b8d0d8dac98f89499bc6b251d1f34dbf68dfe.

Reverted https://github.com/pytorch/pytorch/pull/145573 on behalf of https://github.com/ZainRizvi due to Sorry, but this is failing internally. It's a bit weird since this PR doesn't really appear related at first glance, but despite retries it fails pretty consistently. Please see D68930742 for details ([comment](https://github.com/pytorch/pytorch/pull/145573#issuecomment-2628013872))
2025-01-31 18:13:23 +00:00
67ed47d886 Binary upload checksum (#144887)
Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887
Approved by: https://github.com/atalman
2025-01-31 17:51:27 +00:00
d0748566b4 s390x ci: ensure CI starts correctly if token pipe is not removed (#145840)
Mark stop actions as "may fail".
Container is expected to stop on it's own in normal case.

Remove "may fail" mark from token generation steps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145840
Approved by: https://github.com/huydhn
2025-01-31 17:46:09 +00:00
44ecbcbd5a s390x: disable test_model_exports_to_core_aten.py test (#145835)
It often gets killed by OOM.
Disable it while investigating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145835
Approved by: https://github.com/huydhn
2025-01-31 17:45:10 +00:00
667b94d1c2 [hotfix][dynamo] Skip linecache due to a flaky issue (#146141)
A large number of jit + dynamo wrapped tests fail in linecache tracing.
We need further debugging. Skipping for now to stem the bleeding.

https://github.com/pytorch/pytorch/issues/146076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146141
Approved by: https://github.com/StrongerXi
2025-01-31 17:45:06 +00:00
c3f71eb61b Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit e2917245fb0c0b6aab216e7a0a254b80e7a9e78f.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error.  @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))
2025-01-31 17:43:09 +00:00
f5a61ba0a3 Revert "inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)"
This reverts commit d100e9ae744322a74d9fd05d0851caaf36f19c24.

Reverted https://github.com/pytorch/pytorch/pull/145122 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. See D68924977 for details ([comment](https://github.com/pytorch/pytorch/pull/145122#issuecomment-2627880860))
2025-01-31 17:39:23 +00:00
eb5a0718c2 S390x nightly builds timeouts (#146041)
Sometimes build timeouts at the end.
This should be fixed by increased timeout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146041
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-01-31 17:29:11 +00:00
001e355a56 Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## How does this work

The format for the checkpoint is as such

```
archive_name/
|_ data.pkl
|_.format_version
|_byteorder
|_data/
  |_ 0
  |_ 1
  |_ 2
  |_ ...
|_
```

Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them.

For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage.

Note that we always use `miniz` writer  in the zip64 mode per [here](7796e308d0/caffe2/serialize/inline_container.cc (L701)) A zipfile record written by miniz looks as such

```
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
| 30 byte header | n byte filename | zip64_extra_data | m byte padding | storage | 16 or 24 byte local dir footer  |
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
```

- The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290)
- filename will be `"{archive_name}/{filepath}"`

- `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524))
- `m` is determined by [`getPadding`](7796e308d0/caffe2/serialize/inline_container.cc (L254)), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes
- The local dir footer size is determined based on [this snippet ](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16.

When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following
- We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts
- for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage
- for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage
- Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case
- After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-31 17:09:20 +00:00
98f87edd23 Tensor .cuda() very slow with specific array sizes (#138964)
### **Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch**

#### **Summary**
This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead.

#### **Tracing the Issue**
To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process:

1. **Python Call: `tensor.cuda()`**
   - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU.

2. **`TensorBody.h: cuda()`**
   - The `cuda()` method calls `to()`, specifying the target device as CUDA.

3. **`Tensor.cpp: TensorBase::to()`**
   - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`.

4. **Operator Call: `_ops::to_dtype_layout::call()`**
   - This operator dispatches the request to the backend-specific function responsible for managing the transfer.

5. **`Copy.cpp: copy_()`**
   - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`.

6. **`Copy.cpp: copy_impl()`**
   - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`.

7. **Dispatch to CUDA: `copy_stub`**
   - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`.

8. **`Copy.cu: copy_kernel_cuda()`**
   - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function.

#### **Solution**
To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance:

- **Efficiency of `cudaMemcpy2DAsync`**: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors.
- **Reduction of Overhead**: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers.
- **Asynchronous Execution**: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts.

#### **Performance Results**

In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks:

```plaintext
Average time for contiguous slice (rows 10,000-50,000): 66 ms
Average time for non-contiguous slice (rows 10,000-50,000): 66 ms

Validation of contiguous and non-contiguous tensor copies:
 PASS: Tensor shapes match.
 PASS: Tensor contiguity matches.
 PASS: Tensor contents match.
 PASS: Tensor data types match.

 Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU.
```

#### **Conclusion**
This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964
Approved by: https://github.com/jeffdaily

Co-authored-by: Natalia Gimelshein <ngimel@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-31 17:05:02 +00:00
2d6f6637d3 Remove lexicographical sorting of storage keys in torch.save (#143879)
Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted)

This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads

* __->__ #143879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879
Approved by: https://github.com/albanD
2025-01-31 17:00:23 +00:00
9232355bb0 Add CUDA 12.8 manywheel x86 Builds to Binaries Matrix (#145792)
https://github.com/pytorch/pytorch/issues/145570

Adding cuda 12.8.0 x86 builds first
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145792
Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman
2025-01-31 16:12:02 +00:00
a7c2d85c18 Add overloads to diagonal docs (#144214)
Fixes #126827. Refactored doc to demonstrate when none of the optional values are passed in. Added another example so that all overloads of the function are covered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144214
Approved by: https://github.com/albanD
2025-01-31 15:53:59 +00:00
2af876707b [AOTI] Fix a memory leak in package boxed_run (#146100)
Summary: AOTIModelPackageLoaderPybind::boxed_run missed a decref when constructing the returned py::list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146100
Approved by: https://github.com/cpuhrsch
2025-01-31 13:32:28 +00:00
7b07415aaa [export] nested terms in nn_module_stack deserialization (#145901)
Summary: accounting for terms like "getattr(getattr(a[0], b), c)".

Test Plan: test_serialize

Differential Revision: D68784736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145901
Approved by: https://github.com/angelayi
2025-01-31 10:00:13 +00:00
1f1a9965d5 fix a small typo in comments (#145323)
A minor typo fix.
The description was confusing with the typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145323
Approved by: https://github.com/Skylion007
2025-01-31 06:45:44 +00:00
c55af2b567 [CMake] Delete Caffe2 inspect_gpu binary (#146105)
As it's unbuildable right now, as headers it depends on are gone

Fixes https://github.com/pytorch/pytorch/issues/146042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146105
Approved by: https://github.com/atalman, https://github.com/seemethere
2025-01-31 06:42:52 +00:00
e84bf88dde [ATen][CUDA] Implement 128 bit vectorization v2 (#145746)
This is a re-base PR to my previous one #141959.

Description from the original PR:

This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100.

<details>

<summary>The benchmark code used </summary>

```Python

import time
import torch
from torch.profiler import profile, ProfilerActivity

def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False):
    device = torch.device("cuda")

    shapes = []
    for p in range(24, 30):
        shape = 1<<p
        shapes.append(shape)

    for shape in shapes:
        for _ in range(6):
            x = torch.randn(shape, device=device, dtype=dtype)
            y = function(x)

        if print_profile:
            x = torch.randn(shape, device=device, dtype=dtype)
            with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
                y = function(x)
            print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

        x = torch.randn(shape, device=device, dtype=dtype)
        torch.cuda.synchronize()
        t1 = time.perf_counter()
        for _ in range(6):
            y = function(x)
        torch.cuda.synchronize()
        t2 = time.perf_counter()
        perf_time = (t2 - t1) / 6

        print(f"{function.__name__}, {dtype}, {shape}, {perf_time}")
        if check_numerics:
            x_cpu = x.cpu()
            y_cpu = function(x_cpu).cuda()
            try:
                torch.testing.assert_allclose(y_cpu, y)
            except AssertionError as error:
                print("An exception occurred:", error)

def main():
    ops = [
            torch.relu,
            torch.sigmoid,
            torch.tanh,
            torch.nn.functional.gelu,
            torch.sin,
            torch.exp,
    ]

    dtypes = [
            torch.float16,
            torch.bfloat16,
            torch.float32,
    ]

    for op in ops:
        for dtype in dtypes:
            benchmark(op, dtype=dtype)
            torch.cuda.empty_cache()

if __name__ == "__main__":
    main()
```

</details>

<details>

<summary> Results </summary>

| op | dtype | size | time after | time before | % improvement |
| ---- | ---- | ---- | ---- | ---- | ---- |
| relu | torch.float16 | 33554432 | 4.84E-05 | 5.06E-05 | 4.66296539127052 |
| relu | torch.float16 | 67108864 | 9.22E-05 | 9.64E-05 | 4.56491432752297 |
| relu | torch.float16 | 134217728 | 0.000180343495837102 | 0.000187981834945579 | 4.23543919508829 |
| relu | torch.float16 | 268435456 | 0.000355071155354381 | 0.000370856161074092 | 4.44558942107169 |
| relu | torch.float16 | 536870912 | 0.000704489842367669 | 0.000736006341564159 | 4.47366268483987 |
| relu | torch.bfloat16 | 16777216 | 3.03E-05 | 3.04E-05 | 0.166504085842689 |
| relu | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.45848238875716 |
| relu | torch.bfloat16 | 67108864 | 9.32E-05 | 9.65E-05 | 3.56122651631445 |
| relu | torch.bfloat16 | 134217728 | 0.000180805509444326 | 0.000187998676362137 | 3.97840029317567 |
| relu | torch.bfloat16 | 268435456 | 0.000356242332297067 | 0.000371279485989362 | 4.22104627356745 |
| relu | torch.bfloat16 | 536870912 | 0.000708114336399982 | 0.000736773828975856 | 4.04729732229083 |
| relu | torch.float32 | 16777216 | 5.61E-05 | 5.61E-05 | 0.0442587268354941 |
| relu | torch.float32 | 33554432 | 9.33E-05 | 9.30E-05 | -0.259070913799022 |
| relu | torch.float32 | 67108864 | 0.000181321326332788 | 0.000181289506144822 | -0.0175490597877115 |
| relu | torch.float32 | 134217728 | 0.000356896334172537 | 0.000356570177245885 | -0.0913870206618981 |
| relu | torch.float32 | 268435456 | 0.000709421835684528 | 0.000707465515006334 | -0.275762681635911 |
| relu | torch.float32 | 536870912 | 0.00141372415237129 | 0.00141036518228551 | -0.237597276678471 |
| sigmoid | torch.float16 | 16777216 | 3.10E-05 | 3.16E-05 | 2.10012593866895 |
| sigmoid | torch.float16 | 33554432 | 4.91E-05 | 5.23E-05 | 6.37710600666122 |
| sigmoid | torch.float16 | 67108864 | 9.30E-05 | 0.000100057009452333 | 7.61866144555331 |
| sigmoid | torch.float16 | 134217728 | 0.000180928347011407 | 0.000194982004662355 | 7.76752669390248 |
| sigmoid | torch.float16 | 268435456 | 0.000355658994521946 | 0.00038468533117945 | 8.16128288742412 |
| sigmoid | torch.float16 | 536870912 | 0.000705982849467546 | 0.000764021339515845 | 8.22094900634937 |
| sigmoid | torch.bfloat16 | 16777216 | 3.08E-05 | 3.17E-05 | 2.90965915673149 |
| sigmoid | torch.bfloat16 | 33554432 | 4.87E-05 | 5.24E-05 | 7.63503884668234 |
| sigmoid | torch.bfloat16 | 67108864 | 9.33E-05 | 0.000100019678939134 | 7.21238137428013 |
| sigmoid | torch.bfloat16 | 134217728 | 0.000180786165098349 | 0.000194868014659733 | 7.78922964250206 |
| sigmoid | torch.bfloat16 | 268435456 | 0.000355564659306159 | 0.000384909333661199 | 8.25297835063321 |
| sigmoid | torch.bfloat16 | 536870912 | 0.000705831005082776 | 0.000764102345177283 | 8.2557070566308 |
| sigmoid | torch.float32 | 16777216 | 4.93E-05 | 5.65E-05 | 14.5314136197766 |
| sigmoid | torch.float32 | 33554432 | 9.32E-05 | 9.31E-05 | -0.120169865610833 |
| sigmoid | torch.float32 | 67108864 | 0.000181328505277634 | 0.000180455681402236 | -0.481349512069855 |
| sigmoid | torch.float32 | 134217728 | 0.000357362829769651 | 0.000356093340087682 | -0.35523831137877 |
| sigmoid | torch.float32 | 268435456 | 0.000708921831877281 | 0.000707052337626616 | -0.263709504574663 |
| sigmoid | torch.float32 | 536870912 | 0.00141358317341656 | 0.0014090768333214 | -0.318788464654745 |
| tanh | torch.float16 | 16777216 | 3.03E-05 | 3.03E-05 | -0.0912564658661808 |
| tanh | torch.float16 | 33554432 | 4.90E-05 | 5.07E-05 | 3.46644442974484 |
| tanh | torch.float16 | 67108864 | 9.30E-05 | 9.68E-05 | 3.99871369815531 |
| tanh | torch.float16 | 134217728 | 0.00018052199933057 | 0.000188717152923346 | 4.53969799978138 |
| tanh | torch.float16 | 268435456 | 0.000355684508879979 | 0.000373026006855071 | 4.8755280430115 |
| tanh | torch.float16 | 536870912 | 0.000706660988119741 | 0.000740105014604827 | 4.73268328765002 |
| tanh | torch.bfloat16 | 16777216 | 2.99E-05 | 3.03E-05 | 1.21049563135981 |
| tanh | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.48836101041744 |
| tanh | torch.bfloat16 | 67108864 | 9.28E-05 | 9.69E-05 | 4.39944918036626 |
| tanh | torch.bfloat16 | 134217728 | 0.000180710999605556 | 0.000189167990659674 | 4.67984299382829 |
| tanh | torch.bfloat16 | 268435456 | 0.000356062994493792 | 0.000372666652159144 | 4.66312363882606 |
| tanh | torch.bfloat16 | 536870912 | 0.000707100164921333 | 0.000740134331863374 | 4.67178040408393 |
| tanh | torch.float32 | 16777216 | 5.61E-05 | 5.64E-05 | 0.439595755746353 |
| tanh | torch.float32 | 33554432 | 9.31E-05 | 9.31E-05 | 0.00287633090228212 |
| tanh | torch.float32 | 67108864 | 0.000181465332085888 | 0.000180895323865116 | -0.31411411437098 |
| tanh | torch.float32 | 134217728 | 0.000356963835656643 | 0.000356073161431899 | -0.249513854283251 |
| tanh | torch.float32 | 268435456 | 0.000709201170442005 | 0.00070707315656667 | -0.300057862849997 |
| tanh | torch.float32 | 536870912 | 0.00141367283261692 | 0.00141030051357423 | -0.238550176877922 |
| gelu | torch.float16 | 16777216 | 2.73E-05 | 3.17E-05 | 15.921079070745 |
| gelu | torch.float16 | 33554432 | 5.06E-05 | 5.55E-05 | 9.76345374333098 |
| gelu | torch.float16 | 67108864 | 9.65E-05 | 0.000106600326641152 | 10.4308039074712 |
| gelu | torch.float16 | 134217728 | 0.000187776672343413 | 0.000208565829476962 | 11.0712139447915 |
| gelu | torch.float16 | 268435456 | 0.000370216167842348 | 0.000412251994324227 | 11.3544005187205 |
| gelu | torch.float16 | 536870912 | 0.000737301345604161 | 0.000819394170927505 | 11.1342296895002 |
| gelu | torch.bfloat16 | 16777216 | 3.02E-05 | 3.08E-05 | 1.78405479367653 |
| gelu | torch.bfloat16 | 33554432 | 5.13E-05 | 5.69E-05 | 10.9929393318302 |
| gelu | torch.bfloat16 | 67108864 | 9.76E-05 | 0.00010968199543034 | 12.3420807512356 |
| gelu | torch.bfloat16 | 134217728 | 0.000189661824454864 | 0.000214487663470209 | 13.0895287371091 |
| gelu | torch.bfloat16 | 268435456 | 0.000374197009174774 | 0.000423670164309442 | 13.2211519391275 |
| gelu | torch.bfloat16 | 536870912 | 0.000743675006863972 | 0.000842577001700799 | 13.299088166737 |
| gelu | torch.float32 | 16777216 | 5.06E-05 | 5.04E-05 | -0.413385894716413 |
| gelu | torch.float32 | 33554432 | 9.31E-05 | 9.32E-05 | 0.134157041722546 |
| gelu | torch.float32 | 67108864 | 0.000181480175039421 | 0.000180836669945469 | -0.354586992112075 |
| gelu | torch.float32 | 134217728 | 0.000356874331676712 | 0.000356305002545317 | -0.159532104402047 |
| gelu | torch.float32 | 268435456 | 0.000708909006789327 | 0.000706991491218408 | -0.270488250615287 |
| gelu | torch.float32 | 536870912 | 0.00141321367118508 | 0.00140937082081412 | -0.271922813181618 |
| sin | torch.float16 | 16777216 | 3.04E-05 | 3.11E-05 | 2.21834939018859 |
| sin | torch.float16 | 33554432 | 4.85E-05 | 5.23E-05 | 7.72165512511596 |
| sin | torch.float16 | 67108864 | 9.31E-05 | 9.98E-05 | 7.24947099480072 |
| sin | torch.float16 | 134217728 | 0.000180371008658161 | 0.000194791161144773 | 7.99471744039613 |
| sin | torch.float16 | 268435456 | 0.000355454161763191 | 0.000384903668115536 | 8.28503630574026 |
| sin | torch.float16 | 536870912 | 0.000705183832906187 | 0.000764360166310022 | 8.39161799270973 |
| sin | torch.bfloat16 | 16777216 | 3.11E-05 | 3.10E-05 | -0.257677954940036 |
| sin | torch.bfloat16 | 33554432 | 4.89E-05 | 5.24E-05 | 7.34808420323539 |
| sin | torch.bfloat16 | 67108864 | 9.26E-05 | 0.000100248667877167 | 8.22347488801205 |
| sin | torch.bfloat16 | 134217728 | 0.000180674154156198 | 0.00019567032965521 | 8.30012215584937 |
| sin | torch.bfloat16 | 268435456 | 0.000355360486234228 | 0.000386023331278314 | 8.62865913118873 |
| sin | torch.bfloat16 | 536870912 | 0.00070483615854755 | 0.000766805159704139 | 8.79197248964745 |
| sin | torch.float32 | 16777216 | 5.67E-05 | 5.64E-05 | -0.441348534920039 |
| sin | torch.float32 | 33554432 | 9.34E-05 | 9.30E-05 | -0.496458540364117 |
| sin | torch.float32 | 67108864 | 0.000181706990891447 | 0.000180556671693921 | -0.633062708199702 |
| sin | torch.float32 | 134217728 | 0.000356894995396336 | 0.000356046327700218 | -0.237791985616354 |
| sin | torch.float32 | 268435456 | 0.000708777321657787 | 0.000707602652255446 | -0.165731798471427 |
| sin | torch.float32 | 536870912 | 0.00141263716310884 | 0.00140912582476934 | -0.248566187496451 |
| exp | torch.float16 | 16777216 | 3.00E-05 | 3.04E-05 | 1.40099098901014 |
| exp | torch.float16 | 33554432 | 4.86E-05 | 5.03E-05 | 3.44611943643906 |
| exp | torch.float16 | 67108864 | 9.37E-05 | 9.55E-05 | 1.96412400380129 |
| exp | torch.float16 | 134217728 | 0.000180913504057874 | 0.000187193179347863 | 3.47109262113439 |
| exp | torch.float16 | 268435456 | 0.00035607748820136 | 0.000369079003576189 | 3.65131630210701 |
| exp | torch.float16 | 536870912 | 0.000707551507124056 | 0.000732363162872692 | 3.50669251620789 |
| exp | torch.bfloat16 | 16777216 | 2.98E-05 | 3.04E-05 | 1.74345594341654 |
| exp | torch.bfloat16 | 33554432 | 4.88E-05 | 5.04E-05 | 3.40217856534821 |
| exp | torch.bfloat16 | 67108864 | 9.32E-05 | 9.62E-05 | 3.29219958210226 |
| exp | torch.bfloat16 | 134217728 | 0.000180999826019009 | 0.000187239318620414 | 3.44723679499521 |
| exp | torch.bfloat16 | 268435456 | 0.000355944503098726 | 0.000369370992605885 | 3.77207384585864 |
| exp | torch.bfloat16 | 536870912 | 0.000707135167128096 | 0.000733066000975668 | 3.66702648277075 |
| exp | torch.float32 | 16777216 | 4.89E-05 | 5.63E-05 | 15.1245314346532 |
| exp | torch.float32 | 33554432 | 9.34E-05 | 9.31E-05 | -0.259945454477446 |
| exp | torch.float32 | 67108864 | 0.000181152504713585 | 0.000180474346658836 | -0.374357536939058 |
| exp | torch.float32 | 134217728 | 0.000356771342922002 | 0.000355627329554409 | -0.3206573034212 |
| exp | torch.float32 | 268435456 | 0.000708404501589636 | 0.00070713268360123 | -0.179532736671163 |
| exp | torch.float32 | 536870912 | 0.00141283582585553 | 0.00140944866385932 | -0.23974208002295 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746
Approved by: https://github.com/eqy, https://github.com/ngimel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-01-31 06:42:08 +00:00
eeb5e1bf20 [AOTI] Cache treespec_loads calculation (#145815)
Summary: Treespec can be reused instead of calculated from str every AOTI module call. Using cached result saves 0.2ms for each module call.

Test Plan:
Before:
{F1974751578}

After:
 {F1974751667}

Differential Revision: D68749539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145815
Approved by: https://github.com/henrylhtsang
2025-01-31 06:38:21 +00:00
57d8278ab9 pickler for GraphModule (#141659)
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.

Differential Revision: [D68921318](https://our.internmc.facebook.com/intern/diff/D68921318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
2025-01-31 05:34:28 +00:00
f9227e7c33 Expose ToIValueAllowNumbersAsTensors to TORCH_PYTHON_API so we can use it in monarch (#146087)
Summary: TSIA

Test Plan: Tested up the stack but existing unittests

Reviewed By: suo

Differential Revision: D68917233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146087
Approved by: https://github.com/suo
2025-01-31 05:08:11 +00:00
cf2de4e230 Introduce aoti_call_delegate HOP (#145630)
Summary:
Previously, aoti compile node is represented as a kernel-less custom op in the exported program. The node was not eager runnable, which is a common practice for numerical validation during lowering.

I introduce a new HOP to address this.

The schema is following
```
aoti_call_delegate(lower_moduel: AOTInductorEPModule, original_gm: fx.GraphModule, weights: List[Tensor], inputs: List[Tensor])
```

There are a few problems exposed by HOP
- AOTI expects a FX graph with weights as getattr nodes, aka stateful graph. HOP expect graph_module arguments to be stateless. Export serializer also expect a stateless graph. Currently, to make AOTI happy, I am making `original_gm` stateful, and bypassing the serialization for `original_gm`.
- As a result, the HOP is not re-traceable, as functionalization on stateful graph module argument will fail.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test

Reviewed By: zhxchen17

Differential Revision: D68359391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145630
Approved by: https://github.com/zou3519
2025-01-31 04:57:36 +00:00
f358d4d004 [ONNX] Migrate test_torch_export_with_onnxruntime.py to test_small_models_e2e.py (#146095)
With [the deprecation of torch.onnx.dynamo_export](https://github.com/pytorch/pytorch/pull/146003), this PR turns the torch.export related tests toward torch.onn.export(..., dynamo=True), and places it in test_small_models_e2e.py

NOTE: test_exported_program_as_input_from_file and test_onnx_program_supports_retraced_graph are not kept, because they are more of testing whether exported program stays the same after save/load and retrace. However, in torch.onnx.export(..., dynamo=True), we focus more on the export of from nn.Module to ONNX proto.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146095
Approved by: https://github.com/justinchuby
2025-01-31 03:40:26 +00:00
27e35de6c2 [export] Add distributed test (#146050)
Reland https://github.com/pytorch/pytorch/pull/145886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146050
Approved by: https://github.com/avikchaudhuri
2025-01-31 02:56:42 +00:00
ffb424eab6 [dynamo/export] call local_scalar_dense when full() value is scalar tensor (#144999)
Fixes https://github.com/pytorch/pytorch/issues/144907
```
        class Foo(torch.nn.Module):
            def forward(self, val):
                return torch.full((80, 2), val, dtype=torch.float32)

        export(Foo(), args=(torch.tensor(1),))
```

When we have a `torch.full` call like above, where the fill value is a scalar Tensor and not a scalar value, the FX graph from `_dynamo.export()` contains a single node: the full op. We run into a `PendingUnbackedSymbolNotFound` error, because the `item()` call is implicit; the UnbackedSymInt is extracted but goes directly into the data of the output tensor value, and we're then unable to locate it when we try to compute unbacked bindings.

On the other hand, non-strict export doesn't face this, because an explicit `item()`, or `local_scalar_dense` node is inserted, and the unbacked binding is directly the example value of that node.

This adds a dynamo handler to imitate what happens in non-strict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144999
Approved by: https://github.com/angelayi
2025-01-31 02:45:43 +00:00
e01c898e51 [Customized Optimus] Add select cat aten pass (#145918)
Summary: This is a follow up work of D68695717, where we can further reduce the number of cat kernels in the backward by designing new aten pass in the aten level.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/6943087f-91be-4dbd-9693-df0a11a50b73
Test UI: https://www.internalfb.com/intern/testinfra/testrun/11821949087998233
Network: Up: 101KiB  Down: 132KiB  (reSessionID-60e898af-f366-4247-a9f7-d8d7cd129fe0)
Analyzing targets. Remaining      0/78148
Executing actions. Remaining      0/476147
Command: test.     Finished 2 local
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to add the config

```
        post_grad_fusion_options: {
          "normalization_aten_pass": {},
          "split_cat_aten_pass": {},
          "select_cat_aten_pass": {},
        }
```

{F1974778773}

baseline:

aps-recgpt_ranking_1115_pt2_optimus-e52c1f277e

proposal

aps-recgpt_ranking_1115_pt2_optimus-1b0047ee0e

Differential Revision: D68803384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145918
Approved by: https://github.com/Yuzhen11
2025-01-31 02:35:10 +00:00
08d88127fe Use Magma-cuda 12.8 for libtorch (#146019)
https://github.com/pytorch/pytorch/issues/145570

Build failure for libtorch wheel
`CUDAContext.cpp:(.text+0x157): additional relocation overflows omitted from the output
/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
collect2: error: ld returned 1 exit status`

Unsure if this is related, fixing as a start
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146019
Approved by: https://github.com/eqy
2025-01-31 02:19:23 +00:00
2811f33d12 Fix code cache + freezing compile-time regression (#145868)
Summary: The current implementation introduces a compile-time regression due to overhead hashing large constants. To support freezing+caching, we consider only the tensor metadata of frozen params, but we neglect to do the same for any constants created as a result of folding frozen params. This PR Explicitly marks the constants created during freezing (and constant folding during freezing) and uses that info in the inductor cache to determine when to hash a tensor value+metadata vs. metadata only.

Test Plan: `python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only alexnet --bfloat16 --cold-start-latency --print-compilation-time --inference --performance --freezing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145868
Approved by: https://github.com/eellison
2025-01-31 02:04:15 +00:00
bf9d053fb8 [Break XPU] Fix Inductor cuda bias UT (#145934)
# Motivation
[Break XPU] inductor ut: `inductor/test_inplace_padding.py::InplacePaddingTest::test_pad_non_zero - RuntimeError: Expected to find "empty_strided_cuda((2048, 2048), (2048, 1), torch.float32).as_strided((2048, 2047), (2048, 1))" but did not find it`

With this PR, `test_pad_non_zero` will pass on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145934
Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/desertfire
2025-01-31 01:39:39 +00:00
ccd27e8129 Turn on fx graph cache and automatic dynamic pgo local caches in fbcode (#146065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146065
Approved by: https://github.com/jamesjwu
2025-01-31 01:11:48 +00:00
3fae5c8509 torchgen: support exception boundary for ExecuTorch functions (#144341)
Needed for ExecuTorch diff D67904052.

Differential Revision: [D67906411](https://our.internmc.facebook.com/intern/diff/D67906411/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144341
Approved by: https://github.com/Jack-Khuu
2025-01-31 01:05:21 +00:00
cyy
d94d816d96 Simplify handling of max jobs in CMake builds (#145820)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145820
Approved by: https://github.com/malfet
2025-01-31 00:55:39 +00:00
c70362fac8 [AsyncMM] re-enable and adapt to cutlass 3.6.0 (#144011)
[D68734067](https://our.internmc.facebook.com/intern/diff/D68734067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-31 00:48:51 +00:00
1e3d1738a4 [dynamo][polyfills]Support getrecursionlimit (#145989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145989
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145986, #145987, #145994
2025-01-31 00:47:31 +00:00
e7bb608d02 [dynamo][dicts] Support construction of types.MappingProxyType (#145994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145994
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145986, #145987
2025-01-31 00:47:31 +00:00
4665bc2cc0 [dynamo][functions] Support id on function (#145987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145987
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #145986
2025-01-31 00:47:23 +00:00
56307dc370 [dynamo][dicts] Raise exception on pop (#145986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145986
Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
2025-01-31 00:47:13 +00:00
e6704a2447 Allow replacing unbacked with very large upperbound by returning no-op for FloorToInt(int) (#146001)
* Let's say x is an integer beyond 2^53 where Python floats lose precision i.e. can't increment by 1.
* Therefore, float(x) will lose precision and won't retain the exact value of x even though it's an integer.
* That means `FloorToInt(very_large_number)` will lose precision if we cast it to float
```
>>> int(float(1000000007999999992))
1000000008000000000
```

This means when we try to do this in set_replacement():
32bb6f83d5/torch/fx/experimental/symbolic_shapes.py (L6011-L6019)

We run into this:
```
TORCH_LOGS="+torch.fx.experimental.symbolic_shapes" pytest -s test_export.py -k test_replace_unbacked_with_very_large_upperbound

  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6258, in _maybe_guard_rel
    self._set_replacement(rhs, self._find(lhs), "trivial_rhs")
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6039, in _set_replacement
    assert tgt_bound.issubset(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2*s0,)), FakeTensor(..., size=(u0,))), **{}):
tgt_bound=VR[4, 1000000008000000000] not a subset of src_bound=VR[4, 1000000007999999992]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146001
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145898
2025-01-31 00:25:20 +00:00
c72b536420 Add manual override flag for core ATen op detection during bc check (#146052)
Fixes https://github.com/pytorch/pytorch/issues/146049

Today the bc detection logic ignores allow_list for core ATen ops (A PR landed 4 months ago to enable this). The problem is that if I have a PR that removes an op, the script can no longer check whether that op is core ATen op (today we just error out).

With my fix: (1) conservatively assume core ATen op in such cases (2) allows the user to specify in their ALLOW_LIST entry that their op is not a core ATen op.)

Test plan:
- This is tested 2 PRs above

016bdafdcb/test/forward_backward_compatibility/check_forward_backward_compatibility.py (L129-L137)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146052
Approved by: https://github.com/albanD
2025-01-30 23:57:01 +00:00
720b8d0d8d [inductor/profiler] add kernel kwargs instrumentation (#145573)
## About

As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc.

## Test program

Note, install triton before proceeding (pip install triton)

triton_test.py>>>
```
import torch
from torch.profiler import profile, ProfilerActivity

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b

def main():
    x = torch.randn(10, 10).cuda()
    y = torch.randn(10, 10).cuda()
    opt_foo = torch.compile(foo)
    z = opt_foo(x, y)

    # Profile the kernel function on the GPU
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True
    ) as prof:
        z = opt_foo(x, y)

    # Export the trace to a file
    prof.export_chrome_trace("my_kernel_trace.json")

if __name__ == "__main__":
    main()
```

Run it and we should get a trace file my_kernel_trace.json

Output has triton event with the kernel_kwargs attribute.
```
  {
    "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815,
    "ts": 2045246693014.959, "dur": 75.662,
    "args": {
      ...
      "kernel_backend": "triton",
      "num_warps": 4,
      "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)",
      "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py",
      "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor"
    }
  },
```

## Unit Test
Updated unit test:
```
pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573
Approved by: https://github.com/davidberard98, https://github.com/jansel
2025-01-30 23:51:44 +00:00
8117656162 nonzero_static with symint size (#146006)
Summary: Previously `nonzero_static` would force specialization on the `size` argument. This PR enables it to be used with a dynamic `size` argument.

Test Plan: added test

Differential Revision: D68874784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146006
Approved by: https://github.com/angelayi
2025-01-30 23:42:42 +00:00
9fdc20809a [PGNCCL] Simplify support macro definition (#145964)
- Promotes usage of `NCCL_VERSION_CODE >= NCCL_VERSION(X, Y, Z)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145964
Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang
ghstack dependencies: #145893
2025-01-30 23:26:32 +00:00
4280232f21 Revert "Advance past fc window for stft center (#145437)"
This reverts commit 3ef1551f5a745c1d37ff421eb4678814ef4483e4.

Reverted https://github.com/pytorch/pytorch/pull/145437 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks some slow trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145437#issuecomment-2625840742))
2025-01-30 23:14:16 +00:00
f85e4c1360 Enable C++ API parity tests on AArch64 (#145370)
Re-enables C++ API parity tests on AArch64 which now pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145370
Approved by: https://github.com/albanD
2025-01-30 22:42:49 +00:00
2f60f12f8b [Torch] Extract arange_out resizing logic into a helper function that can be used by other devices (#145747)
Summary: We want to use the resizing implementation for arange_out in other devices (in this case MTIA), to make sure that the computations match and to avoid off-by-one-errors.

Test Plan: Existing CI tests pass.

Differential Revision: D68694489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145747
Approved by: https://github.com/mortzur
2025-01-30 22:37:00 +00:00
99a0940991 [MPS] Fix regression in con-contig bitwise ops (#146085)
Caused by https://github.com/pytorch/pytorch/pull/128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous

Fixes https://github.com/pytorch/pytorch/issues/145203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146085
Approved by: https://github.com/dcci
2025-01-30 22:36:56 +00:00
e2917245fb [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-01-30 22:33:50 +00:00
7391cea857 Revert "[triton] Update pin to tip of 3.2 release (#145867)"
This reverts commit 5e5da9bd9afdbb51da3dcc39947347279ccd9130.

Reverted https://github.com/pytorch/pytorch/pull/145867 on behalf of https://github.com/ZainRizvi due to Sorry, this PR may have been written correctly, but something is clearly broken with the infra that's making CI very unhappy with this new triton version.  Since this has been blocking viable/strict upgrades for a couple days now, I'm reverting this PR.  I'll sync with @atalman on how we should fix this. ([comment](https://github.com/pytorch/pytorch/pull/145867#issuecomment-2625720817))
2025-01-30 22:24:09 +00:00
23695ea002 Fix dynamo use of list[int] in graph break (#145554)
This reintroduces the change backed out by #145393 and fixes the underlying problem.

Although using a BuiltinVariable was better than nothing when we saw a GenericAlias it had problems if there was a graph break and we had to reconstruct the original python code which BuiltinVariable did as a simple `list` instead of a `list[int]`.

This changes it to use a TypingVariable instead and then teaches TypingVariable how to reconstruct.

Original commit changeset: 77b9193acb23

python test/dynamo/test_repros.py ReproTests.test_graph_break_on_jit_isinstance

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145554
Approved by: https://github.com/anijain2305
ghstack dependencies: #145551, #145552, #145553
2025-01-30 22:21:40 +00:00
fbb076cc45 Fix call to create_load_global (#145553)
There is no version of create_load_global() that takes three parameters - any use of this function will fail. I think this is probably the correct fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145553
Approved by: https://github.com/anijain2305
ghstack dependencies: #145551, #145552
2025-01-30 22:21:40 +00:00
ccbbc88bbb Turn on mypy for _dynamo/variables/builtin.py (#145552)
The fact that mypy errors were ignored was hiding several bugs in builtin.py (for example the previous diff's incorrect override and use of `call_getattr`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145552
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #145551
2025-01-30 22:21:32 +00:00
f3120f6d26 Remove incorrect BuiltinVariable.call_hasattr() (#145551)
BuiltinVariable.call_hasattr() overrides the base class - but actually behaves differently. The base is `obj.call_hasattr(tx, attr)` but BuiltinVariable's version is `<unused>.call_hasattr(tx, obj, attr)`.

The BuiltinVariable version is used as a pattern from `call_self_handler()` for `BuiltinVariable(hasattr)`. I think the other version is just used for internal `hasattr(obj, name)` so I renamed that one to `call_obj_hasattr`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145551
Approved by: https://github.com/anijain2305
2025-01-30 22:21:19 +00:00
clr
d100e9ae74 inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)
If a nn.module getattr call throws, we should make sure that we don't crash with an internal error

Note that I couldn't figure out how to test this, so advice would be awesome.  I have my best case attempt at  https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122
Approved by: https://github.com/jansel
2025-01-30 21:55:29 +00:00
08ff11e9d0 initialize device when pinning memory on this device, short circuit i… (#145752)
…s_pinned if device is not initialized
Do not land
RFC
potential fix for #144687

Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752
Approved by: https://github.com/albanD

Co-authored-by: albanD <albandes@fb.com>
2025-01-30 21:37:29 +00:00
1252c1933d Update to remind users to use torch.compile template (#145960)
Users have been submitting fuzzer issues without meeting the requirements outline in the torch.compile issue template. This updates the note to remind users to use the torch.compile template for torch.compile bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145960
Approved by: https://github.com/eellison
2025-01-30 21:34:40 +00:00
d14046b58d Update fuzzer guidance to include rng (#145962)
Add another condition to fuzzer issue guidance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145962
Approved by: https://github.com/eellison
2025-01-30 21:33:57 +00:00
7e7341bddd [hop] fix unbacked_bindings meta for while_loop (#143559)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143559
Approved by: https://github.com/zou3519
2025-01-30 21:33:09 +00:00
9f9904172d [scan] scan dim handling in user-facing scan() (#145179)
This PR introduces the capability that the scan dim is handled in the user facing scan() call. Internally, the scan dim is always shifted to dim 0 and then the scan is performed over that dim.

This is a follow-up PR from https://github.com/bohnstingl/pytorch/pull/3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145179
Approved by: https://github.com/ydwu4
2025-01-30 21:09:07 +00:00
70f6aaa786 [OSS] Add kwargs to fsspec reader/writer (#145845)
Summary: Add kwargs to fsspec reader/writer. This will be used when reading/writing from huggingface because it needs a token to access the repositories

Test Plan: https://fburl.com/anp/agkrlas1 ability to read write to hf with fsspec

Differential Revision: D68738777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145845
Approved by: https://github.com/mhorowitz
2025-01-30 21:00:58 +00:00
e6c39d37e9 [ONNX] Create deprecation warning on dynamo_export (#146003)
Deprecation of `torch.onnx.dynamo_export`:

* [`torch/onnx/_internal/_exporter_legacy.py`](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86): Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`. [[1]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86) [[2]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR231-R234) [[3]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR442-R445) [[4]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR700-R703)

This PR also removed the `**_` kwarg on onnx.export such that users get an error when they supply an unexpected augument.

Updated to emit deprecation warning because it is more appropriate: https://docs.python.org/3/library/exceptions.html#DeprecationWarning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146003
Approved by: https://github.com/titaiwangms
2025-01-30 20:13:32 +00:00
1fdb4d65c0 [MPS] Extend torch.mm/torch.bmm to integral types (#145809)
By using `naive_mm` kernel, but make sure that accumulation is done over int32 for smaller int types (and float for half and bfloat) as well as adding `navie_bmm` that follows the same pattern.
Remove stale restriction on `torch.dot` (which works fine on MacOS-14/15)
This also enables integer op flavors for:
- `addmv`
- `einsum`
- `inner`
- `linalg.multi_dot`
- `matmul`
- `mv`
- `tensordot`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145809
Approved by: https://github.com/dcci
2025-01-30 19:35:25 +00:00
3ef1551f5a Advance past fc window for stft center (#145437)
Long overdue follow-up on https://github.com/pytorch/pytorch/pull/73432/files#diff-5f3d4caa0693a716fc46fd7f6339312f1b5f0bf89e3a3ff58e9dc13a9486b17aR719

Onnx stft doesn't support centering, [and all of the existing tests are for center = False](https://github.com/pytorch/pytorch/blob/main/test/onnx/test_pytorch_onnx_onnxruntime.py#L8026). I will open a follow-up issue to address this, this is just a nice-to-have.

Pr chain:
- -> [Advance past fc window for stft center #145437](https://github.com/pytorch/pytorch/pull/145437)
- [Add stft option to align window for center = false #145324](https://github.com/pytorch/pytorch/pull/145324)
- [Add istft option to align window for center = false](https://github.com/pytorch/pytorch/pull/145510)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145437
Approved by: https://github.com/justinchuby, https://github.com/iseeyuan
2025-01-30 19:09:18 +00:00
a3698ebd5c [while_loop] specialize when cond_fn return constants (#144515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144515
Approved by: https://github.com/zou3519
2025-01-30 19:02:34 +00:00
16420a78eb [AOTI] Remove AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 (#146039)
Summary: The AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 macro was used to solve a FC issue and it can be removed now.

Test Plan: CI

Differential Revision: D68871245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146039
Approved by: https://github.com/yushangdi, https://github.com/hl475
2025-01-30 19:01:19 +00:00
d1143c4b37 [export] fix non-strict pre_dispatch exporting while_loop (#145762)
fix https://github.com/pytorch/pytorch/issues/145737.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145762
Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519, https://github.com/avikchaudhuri
2025-01-30 18:58:34 +00:00
clr
f746bb6311 config: Don't spam warnings about reference type configs (#145800)
Summary:
https://github.com/pytorch/pytorch/issues/145755

The is_dynamic check for reference types was subtly broken, causing log spam
after it was accessed

Added an explicit type for is_default for reference types to make sure this
behaviour is correct
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145800
Approved by: https://github.com/eellison
2025-01-30 18:57:16 +00:00
5a527fa5ee Make sure not using cpp wrapper when setting nvtx training annotation (#145538)
Longer term would be good to add as a feature to cpp_wrapper, but this makes sure it doesn't fail on main.

Not sure if this needs a test because it's not meant to compose, but will add one if necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145538
Approved by: https://github.com/desertfire
2025-01-30 18:34:22 +00:00
3ee655e4d4 [async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846)
There's a sleep that is issued in order to "nudge" CUDA to do the right scheduling decision, but this is issued on iteration number 2. However, when the world size is 2, we never reach that iteration, which led to a suboptimal scheduling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145846
Approved by: https://github.com/yifuwang
2025-01-30 18:26:34 +00:00
51ee9b154e [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-30 18:19:00 +00:00
7796e308d0 Record inputs at time of tracing, constrain to them for triton fn (#145448)
Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context.

We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448
Approved by: https://github.com/zou3519
ghstack dependencies: #145953
2025-01-30 16:54:08 +00:00
967cf85f3a Revert "Update mi300 labels to account for multiple clusters. (#145923)"
This reverts commit 3e135993bd0fa08cbff565ae76bb15cb08e1d6d0.

Reverted https://github.com/pytorch/pytorch/pull/145923 on behalf of https://github.com/atalman due to reverting back to one cluster ([comment](https://github.com/pytorch/pytorch/pull/145923#issuecomment-2625022826))
2025-01-30 16:45:50 +00:00
1c3df9ca8c Fix signif_strides_equal for symints, dedupe (#145953)
Previous impl would take a size hint, which was failing internally with a
```
strides1 = [V.graph.sizevars.size_hint(strides1[i]) for i in non_1_indices]
  File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/torch/_inductor/sizevars.py", line 554, in size_hint
    return int(out)
  File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/sympy/core/expr.py", line 307, in __int__
    raise TypeError("Cannot convert symbols to int")
```

There are unbacked tests in test_triton which should exercise this, as well as other tests for these functions when they were added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145953
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-30 16:44:32 +00:00
aaddfc5a7f Add TORCHINDUCTOR_VEC_ISA_OK env var for vec_isa_ok (#134667)
Adds a `TORCHINDUCTOR_VEC_ISA_OK` for `vec_isa_ok` for A|B testing purposes. Similar setup to `fx_graph_remote_cache` to allow for default `None`.

No tests were present for any other config settings here, nor for `vec_isa_ok` so I didn't add any.

Motivation:
PyTorch uses filelock with a timeout to determine if the CPU supports particular intrinsics: pytorch/torch/_inductor/cpu_vec_isa.py
Therefore if 2 processes are running, each processes encounters the HAS_CPU test, if it cannot acquire the lock for checking vec_isa_ok the main thread will be put to sleep. Hence there is a bias towards non-sleeping processes in acquiring the lock i.e. new spawned processes.

To avoid this, use a env variable so that each process is aware of this without going through the check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134667
Approved by: https://github.com/eellison
2025-01-30 16:22:48 +00:00
5fa28bbe40 Revert "[c10d] Add NCCL memory allocator (#145675)"
This reverts commit 18a7a04c4adecda3be17dd364d48d484fd1dcdba.

Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally. See D68866823 for details ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2624900562))
2025-01-30 16:01:52 +00:00
50086ab537 [ONNX] Delete rename_dynamic_shapes_with_model_inputs (#146002)
Basically, this function brings more cons than pros.

It was nice to have an automation help users to convert top-level key of dynamic shapes to arg names. However, this function has a bug when the model input has the same amount as dynamic_shapes in coincidence:

```python
input_names
# 'input_ids', 'past_key_values.0.key', 'past_key_values.0.value', 'past_key_values.1.key', 'past_key_values.1.value', 'past_key_values.2.key', 'past_key_values.2.value', 'past_key_values.3.key', 'past_key_values.3.value', 'past_key_values.4.key', 'past_key_values.4.value', 'attention_mask', 'position_ids'

inspect.sig(model.forward).parameters
# mappingproxy(OrderedDict([('input_ids', <Parameter "input_ids: Optional[torch.LongTensor] = None">), ('past_key_values', <Parameter "past_key_values: Union[transformers.cache_utils.Cache, Tuple[Tuple[torch.Tensor]], NoneType] = None">), ('attention_mask', <Parameter "attention_mask: Optional[torch.FloatTensor] = None">), ('token_type_ids', <Parameter "token_type_ids: Optional[torch.LongTensor] = None">), ('position_ids', <Parameter "position_ids: Optional[torch.LongTensor] = None">), ('head_mask', <Parameter "head_mask: Optional[torch.FloatTensor] = None">), ('inputs_embeds', <Parameter "inputs_embeds: Optional[torch.FloatTensor] = None">), ('labels', <Parameter "labels: Optional[torch.LongTensor] = None">), ('use_cache', <Parameter "use_cache: Optional[bool] = None">), ('output_attentions', <Parameter "output_attentions: Optional[bool] = None">), ('output_hidden_states', <Parameter "output_hidden_states: Optional[bool] = None">), ('return_dict', <Parameter "return_dict: Optional[bool] = None">), ('cache_position', <Parameter "cache_position: Optional[torch.LongTensor] = None">)]))
```

In the above case, the given input_names is following onnx graph, while it has the same length as torch model forward call. This kind of case makes it difficult to detect, and automate for users.

On the other hand, the error message from torch.export.export is quite informative that I believe users will know how to go from there:

```python

import torch

class Model(torch.nn.Module):
    def forward(self, x=None, y=None):
        return x + y

dim = torch.export.Dim("x", min=1, max=6)
onnx_program = torch.export.export(
    Model(),
    (),
    kwargs={"x": torch.randn(2, 3), "y": torch.randn(2, 3)},
    dynamic_shapes={"custom_input_x": {0: dim}, "custom_input_y": {0: dim}},
)

# torch._dynamo.exc.UserError: When `dynamic_shapes` is specified as a dict, its top-level keys must be the arg names ['x', 'y'] of `inputs`, but here they are ['custom_input_x', 'custom_input_y']. Alternatively, you could also ignore arg names entirely and specify `dynamic_shapes` as a list/tuple matching `inputs`. For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146002
Approved by: https://github.com/justinchuby
2025-01-30 16:01:38 +00:00
894ef8c1e3 [torchbench] Inductor freezing bfloat16 conv folding needs high tolerance (#145623)
Issue:
https://github.com/pytorch/pytorch/issues/144888

Torchbench of timm lcnet_050 model fails on accuracy in case of `--frezing` `--inference` `--bfloat16`
`res_error==0.12`
If to turn off convolution inductor constant folding - `res_error==0.016`

`float16 error ~ 0.00669`
`float16 without conv folding ~ 0.0018`

convolution folding results in increase of error almost at one order of magnitude.

I think we should revisit and try to do something to improve the accuracy for conv folding.
E.g. For example doing conv folding at compilation time with float64?

At the moment I am adding counters to identify if convolution folding happened, and in case of bfloat16 and conv_folding - increase multiplier to the max level (10) to pass accuracy test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145623
Approved by: https://github.com/eellison
2025-01-30 12:46:35 +00:00
ffa628169d [ATen][Native][CUDA][SCALED_MM] limit f8f8bf16 rowwise scaled matmul to sm_90 (#145728)
The CUTLASS-based kernel for f8f8bf16 rowwise scaled matmul is specific to Hopper devices only. It is not re-usable on newer devices without modifications. This PR adds a guard for this matmul to be sm_90 specific. Once the kernel is there, the guard may be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145728
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-01-30 11:19:58 +00:00
6bd19e65b1 add inductor_triton_kernel_mapping_post_grad.json to tlparseadd changes (#145954)
Landing D67612181 here. The original exported PR somehow fails OSS CI, but this one doesn't (though the PR content is the same).

Add debug trace artifact to inductor_triton_kernel_mapping_post_grad.json (debug artifact for provenance tracking) to tlparse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145954
Approved by: https://github.com/YUNQIUGUO
2025-01-30 06:18:48 +00:00
8a6e9a88e9 Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905)
Fixes #145661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905
Approved by: https://github.com/eqy, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-30 05:11:10 +00:00
58cc6693cb [BE] Type annotate wrapper_benchmark.py and cuda_combined_scheduling.py (#145542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145542
Approved by: https://github.com/eellison
2025-01-30 03:53:52 +00:00
8cc6f17334 [CD] Install OpenMP from homebrew (#145889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145889
Approved by: https://github.com/atalman
ghstack dependencies: #145871, #145870
2025-01-30 03:19:51 +00:00
0d5f0a81c5 [CMake] Find HomeBrew OpenMP on MacOS (#145870)
Either via `OMP_PREFIX` envvar or by searching in `/opt/homebrew/opt/libomp` folder

Modify libomp bundling logic in setup.py to change absolute path to libomp.dylib to a relative one if necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870
Approved by: https://github.com/Skylion007, https://github.com/atalman
ghstack dependencies: #145871
2025-01-30 03:19:51 +00:00
cyy
116af809eb Use std::string_view (#145906)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906
Approved by: https://github.com/albanD
2025-01-30 03:14:27 +00:00
933b6d9830 cpp_wrapper: enable in aarch64 and x86 nightly dashboard performance runs (#145791)
Adds `cpp_wrapper` mode to the nightly inductor benchmark runs, as well as optionally for manually triggered runs. This is justified by `aot_inductor` already being in those runs.

Additionally, re-enables `aot_inductor` in the nightly aarch64 runs. It was disabled 5 months ago to deal with a performance instability, which has likely gone away at this point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145791
Approved by: https://github.com/desertfire
2025-01-30 02:55:45 +00:00
32bb6f83d5 Make sure that benchmark_harness is set before running (#145532)
Running torch compile with these options causes an error, because the benchmark code isn't generated but is still called.
```
options={'profile_bandwidth_output': 'foo', 'benchmark_harness': False}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145532
Approved by: https://github.com/eellison
2025-01-30 01:25:53 +00:00
25ca05eebf [PGNCCL] Correct some ifdef's (#145893)
`create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893
Approved by: https://github.com/c-p-i-o
2025-01-30 01:05:21 +00:00
73dde451b7 [pytorch] Sprinkle in a few template keywords (#145877)
Summary:
These seem to be necessary to get compilation working on Windows with
CUDA 12.8. I'm not sure whether this means that all of the previous compilers
were broken, and the new one is better, or whether this is a regression in NVCC
12.8. Either way, as long as the CI passes for existing versions, this should
unblock us from CUDA 12.8 enablement on Windows.

See D68663662 for more details on the CUDA 12.8 enablement.

Test Plan: CI!

Reviewed By: akrieger

Differential Revision: D68787925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145877
Approved by: https://github.com/Skylion007
2025-01-30 00:57:40 +00:00
72699950b0 Copy model before benchmark warmup runs (#145858)
Fixes https://github.com/pytorch/pytorch/issues/144772

The eager warmup runs causes the model to change state so that later when we export it, the model is different than when we export it directly out of box. For some reason exporting the model with the changed state causes issues but exporting the inital model is ok. This is the reason why the accuracy checks pass but the performance check fails when exporting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145858
Approved by: https://github.com/desertfire
2025-01-30 00:36:33 +00:00
clr
6b41f310c2 config: Support str env variables (#145980)
Summary:
This allows us to use environment variables to set string values. We've added
tests for the specific functionality implemented here. Note that we already
accidentally started setting up configs to use this, so we're just adding the
feature.

Additionally, we're not fully validating the underlying type when we set the
value (and in general, it's more difficult than we would like to do this). Let
me know if people feel strongly, and we can add a PR to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145980
Approved by: https://github.com/yushangdi, https://github.com/oulgen
2025-01-30 00:13:02 +00:00
a9ed7bd78e [utilization] pipeline to create clean db records (#145327)
upload_utilization_script to generate db-ready-insert records to s3
- generate two files: metadata and timeseries in ossci-utilization buckets
- convert log record to db format ones
- add unit test job for tools/stats/

Related Prs:
setup composite action for data pipeline: https://github.com/pytorch/pytorch/pull/145310
add permission for composite action to access S3 bucket: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595
add insert logic in s3 replicator: https://github.com/pytorch/test-infra/pull/6217
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145327
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-01-29 23:48:50 +00:00
18a7a04c4a [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-29 23:20:22 +00:00
b60120d0df Revert "[ATen][CUDA] Implement 128 bit vectorization v2 (#145746)"
This reverts commit 81685d81eb86595d169f55a564da26eaafb2ddf5.

Reverted https://github.com/pytorch/pytorch/pull/145746 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking in trunk. See functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_multi_head_attention_forward_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/13032483748/job/36358184032) [HUD commit link](81685d81eb) ([comment](https://github.com/pytorch/pytorch/pull/145746#issuecomment-2623108958))
2025-01-29 23:02:23 +00:00
521588519d re-use FloorDiv for RShift (#145898)
I encountered this C++ compilation error.
```
  579 |     int64_t var_6 = (static_cast<int64_t>(std::floor((1.0/2.0)*u0)) | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))) | std::floor((1.0/16.0)*(static_cast<int64_t>(std::floor((1.0/2.0)*u0)) | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))));
      |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                     |                                                                                                         |
      |                                                                     int64_t {aka long int}                                                                                    double
```

Then, I figured out where this std::floor came from with the help of Bob's guard provenance tool. It comes from RShift which is used in `triton.next_power_of_2`.

---
Before, we used `std::floor`
```
int64_t var_6 = (
   static_cast<int64_t>(std::floor((1.0/2.0)*u0)) |
   static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0)))))
   | std::floor((1.0/16.0)*(static_cast<int64_t>(std::floor((1.0/2.0)*u0))             # no cast to int here.
   | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))));
```

Now, we use `c10::div_floor_integer` instead
```
int64_t var_6 = (
   (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L))) |
   (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))) |
   (c10::div_floor_integer(static_cast<int64_t>((c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L)))
   | (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))), static_cast<int64_t>(16L)));
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145898
Approved by: https://github.com/desertfire, https://github.com/bobrenjc93
ghstack dependencies: #145802
2025-01-29 22:50:22 +00:00
3df961d99b give emulate_precision_casts an envar (#145948)
this was requested internally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145948
Approved by: https://github.com/mlazos
2025-01-29 22:43:32 +00:00
2e5886dcc4 Add fake_impl for unique_consecutive (#145649)
Summary:
It's fairly similar to torch.unique and torch.unique_dim.

Test Plan:
New test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145649
Approved by: https://github.com/ezyang, https://github.com/eellison
2025-01-29 22:33:16 +00:00
1e57154af3 Require that all HOPs be imported at import torch time (#145939)
E.g. torch.ops.higher_order.cond does not exist until it is imported,
which is bad if it shows up in an FX graph or is used in some code
somewhere.

This PR also makes some more HOPs get imported at `import torch` time.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145939
Approved by: https://github.com/ydwu4
ghstack dependencies: #145938
2025-01-29 22:27:52 +00:00
2141c1aebe Better hop_db comment; move test to a non-export test file (#145938)
Goal is for people to better test their HOPs.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145938
Approved by: https://github.com/ydwu4
2025-01-29 22:27:52 +00:00
e02c038a23 [dynamo][benchmarks] Stop benchmarking compile time of dead code (#145590)
FIXES https://github.com/pytorch/pytorch/issues/144775 frfr

See details on the problem: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2611699385
We fixed some silent incorrectness, but it results in less nodes DCE'd. The benchmark iteration loop had some dead code which could contain side effect ops that aren't safe to DCE. The regression is expected.

This PR removes the compile time benchmarking of the dead code, which should reduce the noise of the benchmark and aligns with the benchmarking used by performance tests

New benchmark results:
```python
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,39.322364  # after https://github.com/pytorch/pytorch/pull/144319
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,38.972257  # before https://github.com/pytorch/pytorch/pull/144319
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145590
Approved by: https://github.com/jansel
ghstack dependencies: #145447
2025-01-29 22:14:47 +00:00
793dfc27e0 [inductor] Add some typing to triton.py (#145688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145688
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #145671, #145695
2025-01-29 21:56:40 +00:00
5db0ad92e3 [inductor] Remove mask_str from IndexingOptions (#145695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145695
Approved by: https://github.com/eellison
ghstack dependencies: #145671
2025-01-29 21:56:40 +00:00
23ff899164 [inductor] Fix handling of fixed XBLOCK larger than xnumel=1 (#145671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145671
Approved by: https://github.com/eellison
2025-01-29 21:56:32 +00:00
bb2fb554a9 [BE]: Update CUTLASS submodule to 3.7.0 (#145172)
* This has a couple of new features, but mostly has a lot of bugfixes for the prior releases
* This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it.
* Most of the remaining diff noise is copyright year updates on the CUTLASS submodule
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172
Approved by: https://github.com/eqy, https://github.com/henrylhtsang
2025-01-29 21:48:01 +00:00
d0aa1386b8 Disable AOTAutogradCache for triton version < 3.2 (#145937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145937
Approved by: https://github.com/bdhirsh
2025-01-29 21:32:16 +00:00
1185b81c51 Revert "[dynamo] Use polyfill to implement comparison operators (#144485)"
This reverts commit d1f82de2bf4ce4d4461791a9c9b2e759202db0bb.

Reverted https://github.com/pytorch/pytorch/pull/144485 on behalf of https://github.com/huydhn due to This seems to break dynamo tests in trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/144485#issuecomment-2622893294))
2025-01-29 21:30:42 +00:00
953e80936e [linter] Grep linter batches long command (#145950)
If the command is too long, the linter fails with
```
Failed due to OSError:
[Errno 7] Argument list too long: 'grep'
```
Fix this by batching the command so it is shorter

Limit of 750k was chosen due to `getconf ARG_MAX` returns ~1M on my mac.  My guess is that most people shouldn't hit this unless they run --all-files and the directory length is long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145950
Approved by: https://github.com/wdvr
2025-01-29 21:23:27 +00:00
a6e3f294f1 Don't use mypy daemon in CI (#145961)
This is an attempt to fix flaky mypy errors in CI that look like:

```
dmypy status --verbose
connection_name         : /var/folders/rf/qrn1jkgj0b9_tcznwp8ck46w0000gn/T/tmpjoqsid7_/dmypy.sock
pid                     :      32233
error                   :  timed out
Daemon is stuck; consider /Users/zainr/pytorch/venv/bin/dmypy kill
```

"Fix" it by not using the daemon at all, since it doesn't actually provide any perf benefits in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145961
Approved by: https://github.com/malfet
2025-01-29 21:15:29 +00:00
40ccb7a86d cpp_wrapper: Move #includes to per-device header files (#145932)
Summary:
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Reland https://github.com/pytorch/pytorch/pull/143909 after merge conflicts.

Co-authored-by: Benjamin Glass <[bglass@quansight.com](mailto:bglass@quansight.com)>

Differential Revision: D68656960

Pulled By: benjaminglass1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145932
Approved by: https://github.com/yushangdi, https://github.com/benjaminglass1

Co-authored-by: bglass@quansight.com <bglass@quansight.com>
2025-01-29 21:08:45 +00:00
8bd7bf3269 [Inductor-CPU] Add profiling support for codegened flex attention kernels (#145894)
### Summary

`RECORD_FUNCTION` wasn't present in codegened Inductor-CPU Flex Attention C++ kernels, so flex attention kernels weren't present in the PyTorch profiler profiling data.

Fixes #145825 by adding `RECORD_FUNCTION` calls in the codegened flex-attention kernels.

### Caveat

#### _Before_
No corresponding results in PyTorch profiler profiling data

#### _After_

| Inductor config settings |  What kernel name looks like in profiling data | Comments|
|-------------------|------------------------------------|--------------------|
| Env variable `TORCHINDUCTOR_CPP_WRAPPER=1` OR `inductor.config.cpp_wrapper=1` in python code | `graph_x_cpp_fused_y` | No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
|  `inductor.config.cpp.descriptive_names = "inductor_node"` but not CPP wrapper | `graph_x_kernel` | No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
| Both `inductor_config.cpp.descriptive_names = "inductor_node"`  & Inductor CPP Wrapper | `graph_x_cpp_fused_flex_attention_y`| Easy to interpret data |
| Neither of the two configs  | `graph_x_kernel`| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145894
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
2025-01-29 20:54:46 +00:00
bb4964013f Add determinmistic kernel for reflection2d (#136241)
Adds feature for #98925

Tests pass for both existing reflectionpad2d and the new one I inserted.

**Summary of the work:**

Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor.

So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`.

I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241
Approved by: https://github.com/eqy, https://github.com/albanD
2025-01-29 20:34:03 +00:00
2b8c28099a [OSS] Add no dist as an argument to DCP top level apis (#145754)
Summary: No-dist, for a non-distributed checkpoint, was a top level param in the past, but was removed. This was requested back in https://github.com/pytorch/pytorch/issues/125777 and will be needed for our torchtune changes to use DCP

Test Plan: existing tests pass

Differential Revision: D68714246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145754
Approved by: https://github.com/daulet-askarov
2025-01-29 20:33:37 +00:00
2d5d022594 Fix a number of flexattention issues (cse, cudagraph, etc.) (#145059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145059
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-29 20:27:39 +00:00
6aed6c042e [CD] Install ninja and setuptools from PyPI (#145871)
As well as typing extensions, they are available from PyPI, no reason to install them from Anaconda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871
Approved by: https://github.com/Skylion007
2025-01-29 19:47:16 +00:00
b80482988f Revert "[CMake] Find HomeBrew OpenMP on MacOS (#145870)"
This reverts commit c26bb9ba5bd40d256a25436212279bc7e4b436ae.

Reverted https://github.com/pytorch/pytorch/pull/145870 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))
2025-01-29 19:34:27 +00:00
b52e8d521e Revert "[CD] Install ninja and setuptools from PyPI (#145871)"
This reverts commit eea7d395e5faa9a4be5b60f6668c0bdf5163e3a0.

Reverted https://github.com/pytorch/pytorch/pull/145871 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))
2025-01-29 19:34:27 +00:00
082fab0fc7 [64-bit] Int64 casting for UpSampleNearest3D (#144865)
Fixes #144855

Follows approach in https://github.com/pytorch/pytorch/pull/141923 to use int64 types to increase INT_MAX limits
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144865
Approved by: https://github.com/eqy
2025-01-29 19:30:09 +00:00
1c9014a135 [export] Add tlparse to draft-export (#145810)
Dependent on https://github.com/ezyang/tlparse/pull/87/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145810
Approved by: https://github.com/pianpwk
2025-01-29 19:26:00 +00:00
6371c25b91 Revert "[c10d] Add NCCL memory allocator (#145675)"
This reverts commit 9fd6722fc9068eeaa176754acb315fc7e0f6416c.

Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))
2025-01-29 18:30:30 +00:00
e0525dbca9 Revert "inductor.config.descriptive_names = False is not actually supported (#145523)"
This reverts commit edf266e9bbbf6063f7c4a336ffb50234e11a0a82.

Reverted https://github.com/pytorch/pytorch/pull/145523 on behalf of https://github.com/ZainRizvi due to Hi, this breaks type checks internally. Can you please take a look? See D68801083 for details ([comment](https://github.com/pytorch/pytorch/pull/145523#issuecomment-2622510900))
2025-01-29 18:27:44 +00:00
284f217011 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 97b3b73f3e96bb8684064715b93c825ba0395475.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))
2025-01-29 18:24:29 +00:00
0d6343347f Revert "Record inputs at time of tracing, constrain to them for triton fn (#145448)"
This reverts commit a699034eeca8c096c44a690e405a60efa442d4ed.

Reverted https://github.com/pytorch/pytorch/pull/145448 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68779678 for details ([comment](https://github.com/pytorch/pytorch/pull/145448#issuecomment-2622470810))
2025-01-29 18:07:12 +00:00
1a613c3342 bump counters for unbacked binding names (#145882)
Instead of bumping symint counters when we process unbacked bindings during deserialization, it's better to bump them at the beginning based on what the symbols in the original shape env before serialization were. This allows symbols in unbacked bindings to have "gaps" that bumping alone would not be able to match.

Why is bumping counters important at all? It is because when the shape env coming out of deserialization is used later for propagating symints, say in run_decompositions, we don't want new names to clash with existing names (bad things happen).

Differential Revision: [D68798191](https://our.internmc.facebook.com/intern/diff/D68798191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145882
Approved by: https://github.com/pianpwk
2025-01-29 17:46:21 +00:00
4abff4b271 Introduce cache clearing APIs for the lazy graph executor (#144489)
This PR introduces two new methods to the LazyGraphExecutor class:

- ClearComputationCache(): Allows clearing the entire computation cache.
- RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash.

The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance:
- Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently.
- Selectively remove cache entries to analyze the impact on subsequent computations.
- Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations.

On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash.

Simultaneously, **another motivation**, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489
Approved by: https://github.com/wconstab
2025-01-29 17:38:01 +00:00
d1f82de2bf [dynamo] Use polyfill to implement comparison operators (#144485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485
Approved by: https://github.com/jansel
2025-01-29 17:37:40 +00:00
3e135993bd Update mi300 labels to account for multiple clusters. (#145923)
We now have multiple Kubernetes clusters of mi300x resources, and this commit updates labels accordingly to target both clusters evenly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145923
Approved by: https://github.com/jeffdaily
2025-01-29 16:56:43 +00:00
4499d60d56 [dynamo][builin-skipfiles-cleanup] Remove types (#145909)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145909
Approved by: https://github.com/zou3519
ghstack dependencies: #145856, #145875, #145878, #145892
2025-01-29 16:47:02 +00:00
ed141d7d1a dont assign a size to _assert_scalar in partitioner (#143877)
Fixes https://github.com/pytorch/pytorch/issues/143876

Open to other suggestions - we have an invariant that all nodes in our ATen graphs should have a `meta['val']` field, but I don't think this is actually true in all cases, so I just hardcoded the invariant to ignore `_assert_scalar()` (which is a "special" op used in dynamic shapes for runtime asserts, and doesn't have a meta['val'] field)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143877
Approved by: https://github.com/zou3519
2025-01-29 16:21:37 +00:00
3b3aac0cde Filter out iGPU if dGPU is found on XPU (#144378)
# Motivation
for https://github.com/pytorch/pytorch/issues/143914
On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context.

Now I generalize the logic as below:
1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform.
2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform.
3. No GPU is found (neither iGPU nor dGPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
2025-01-29 15:53:16 +00:00
5e5da9bd9a [triton] Update pin to tip of 3.2 release (#145867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867
Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte
2025-01-29 15:17:58 +00:00
81685d81eb [ATen][CUDA] Implement 128 bit vectorization v2 (#145746)
This is a re-base PR to my previous one #141959.

Description from the original PR:

This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100.

<details>

<summary>The benchmark code used </summary>

```Python

import time
import torch
from torch.profiler import profile, ProfilerActivity

def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False):
    device = torch.device("cuda")

    shapes = []
    for p in range(24, 30):
        shape = 1<<p
        shapes.append(shape)

    for shape in shapes:
        for _ in range(6):
            x = torch.randn(shape, device=device, dtype=dtype)
            y = function(x)

        if print_profile:
            x = torch.randn(shape, device=device, dtype=dtype)
            with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
                y = function(x)
            print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

        x = torch.randn(shape, device=device, dtype=dtype)
        torch.cuda.synchronize()
        t1 = time.perf_counter()
        for _ in range(6):
            y = function(x)
        torch.cuda.synchronize()
        t2 = time.perf_counter()
        perf_time = (t2 - t1) / 6

        print(f"{function.__name__}, {dtype}, {shape}, {perf_time}")
        if check_numerics:
            x_cpu = x.cpu()
            y_cpu = function(x_cpu).cuda()
            try:
                torch.testing.assert_allclose(y_cpu, y)
            except AssertionError as error:
                print("An exception occurred:", error)

def main():
    ops = [
            torch.relu,
            torch.sigmoid,
            torch.tanh,
            torch.nn.functional.gelu,
            torch.sin,
            torch.exp,
    ]

    dtypes = [
            torch.float16,
            torch.bfloat16,
            torch.float32,
    ]

    for op in ops:
        for dtype in dtypes:
            benchmark(op, dtype=dtype)
            torch.cuda.empty_cache()

if __name__ == "__main__":
    main()
```

</details>

<details>

<summary> Results </summary>

| op | dtype | size | time after | time before | % improvement |
| ---- | ---- | ---- | ---- | ---- | ---- |
| relu | torch.float16 | 33554432 | 4.84E-05 | 5.06E-05 | 4.66296539127052 |
| relu | torch.float16 | 67108864 | 9.22E-05 | 9.64E-05 | 4.56491432752297 |
| relu | torch.float16 | 134217728 | 0.000180343495837102 | 0.000187981834945579 | 4.23543919508829 |
| relu | torch.float16 | 268435456 | 0.000355071155354381 | 0.000370856161074092 | 4.44558942107169 |
| relu | torch.float16 | 536870912 | 0.000704489842367669 | 0.000736006341564159 | 4.47366268483987 |
| relu | torch.bfloat16 | 16777216 | 3.03E-05 | 3.04E-05 | 0.166504085842689 |
| relu | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.45848238875716 |
| relu | torch.bfloat16 | 67108864 | 9.32E-05 | 9.65E-05 | 3.56122651631445 |
| relu | torch.bfloat16 | 134217728 | 0.000180805509444326 | 0.000187998676362137 | 3.97840029317567 |
| relu | torch.bfloat16 | 268435456 | 0.000356242332297067 | 0.000371279485989362 | 4.22104627356745 |
| relu | torch.bfloat16 | 536870912 | 0.000708114336399982 | 0.000736773828975856 | 4.04729732229083 |
| relu | torch.float32 | 16777216 | 5.61E-05 | 5.61E-05 | 0.0442587268354941 |
| relu | torch.float32 | 33554432 | 9.33E-05 | 9.30E-05 | -0.259070913799022 |
| relu | torch.float32 | 67108864 | 0.000181321326332788 | 0.000181289506144822 | -0.0175490597877115 |
| relu | torch.float32 | 134217728 | 0.000356896334172537 | 0.000356570177245885 | -0.0913870206618981 |
| relu | torch.float32 | 268435456 | 0.000709421835684528 | 0.000707465515006334 | -0.275762681635911 |
| relu | torch.float32 | 536870912 | 0.00141372415237129 | 0.00141036518228551 | -0.237597276678471 |
| sigmoid | torch.float16 | 16777216 | 3.10E-05 | 3.16E-05 | 2.10012593866895 |
| sigmoid | torch.float16 | 33554432 | 4.91E-05 | 5.23E-05 | 6.37710600666122 |
| sigmoid | torch.float16 | 67108864 | 9.30E-05 | 0.000100057009452333 | 7.61866144555331 |
| sigmoid | torch.float16 | 134217728 | 0.000180928347011407 | 0.000194982004662355 | 7.76752669390248 |
| sigmoid | torch.float16 | 268435456 | 0.000355658994521946 | 0.00038468533117945 | 8.16128288742412 |
| sigmoid | torch.float16 | 536870912 | 0.000705982849467546 | 0.000764021339515845 | 8.22094900634937 |
| sigmoid | torch.bfloat16 | 16777216 | 3.08E-05 | 3.17E-05 | 2.90965915673149 |
| sigmoid | torch.bfloat16 | 33554432 | 4.87E-05 | 5.24E-05 | 7.63503884668234 |
| sigmoid | torch.bfloat16 | 67108864 | 9.33E-05 | 0.000100019678939134 | 7.21238137428013 |
| sigmoid | torch.bfloat16 | 134217728 | 0.000180786165098349 | 0.000194868014659733 | 7.78922964250206 |
| sigmoid | torch.bfloat16 | 268435456 | 0.000355564659306159 | 0.000384909333661199 | 8.25297835063321 |
| sigmoid | torch.bfloat16 | 536870912 | 0.000705831005082776 | 0.000764102345177283 | 8.2557070566308 |
| sigmoid | torch.float32 | 16777216 | 4.93E-05 | 5.65E-05 | 14.5314136197766 |
| sigmoid | torch.float32 | 33554432 | 9.32E-05 | 9.31E-05 | -0.120169865610833 |
| sigmoid | torch.float32 | 67108864 | 0.000181328505277634 | 0.000180455681402236 | -0.481349512069855 |
| sigmoid | torch.float32 | 134217728 | 0.000357362829769651 | 0.000356093340087682 | -0.35523831137877 |
| sigmoid | torch.float32 | 268435456 | 0.000708921831877281 | 0.000707052337626616 | -0.263709504574663 |
| sigmoid | torch.float32 | 536870912 | 0.00141358317341656 | 0.0014090768333214 | -0.318788464654745 |
| tanh | torch.float16 | 16777216 | 3.03E-05 | 3.03E-05 | -0.0912564658661808 |
| tanh | torch.float16 | 33554432 | 4.90E-05 | 5.07E-05 | 3.46644442974484 |
| tanh | torch.float16 | 67108864 | 9.30E-05 | 9.68E-05 | 3.99871369815531 |
| tanh | torch.float16 | 134217728 | 0.00018052199933057 | 0.000188717152923346 | 4.53969799978138 |
| tanh | torch.float16 | 268435456 | 0.000355684508879979 | 0.000373026006855071 | 4.8755280430115 |
| tanh | torch.float16 | 536870912 | 0.000706660988119741 | 0.000740105014604827 | 4.73268328765002 |
| tanh | torch.bfloat16 | 16777216 | 2.99E-05 | 3.03E-05 | 1.21049563135981 |
| tanh | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.48836101041744 |
| tanh | torch.bfloat16 | 67108864 | 9.28E-05 | 9.69E-05 | 4.39944918036626 |
| tanh | torch.bfloat16 | 134217728 | 0.000180710999605556 | 0.000189167990659674 | 4.67984299382829 |
| tanh | torch.bfloat16 | 268435456 | 0.000356062994493792 | 0.000372666652159144 | 4.66312363882606 |
| tanh | torch.bfloat16 | 536870912 | 0.000707100164921333 | 0.000740134331863374 | 4.67178040408393 |
| tanh | torch.float32 | 16777216 | 5.61E-05 | 5.64E-05 | 0.439595755746353 |
| tanh | torch.float32 | 33554432 | 9.31E-05 | 9.31E-05 | 0.00287633090228212 |
| tanh | torch.float32 | 67108864 | 0.000181465332085888 | 0.000180895323865116 | -0.31411411437098 |
| tanh | torch.float32 | 134217728 | 0.000356963835656643 | 0.000356073161431899 | -0.249513854283251 |
| tanh | torch.float32 | 268435456 | 0.000709201170442005 | 0.00070707315656667 | -0.300057862849997 |
| tanh | torch.float32 | 536870912 | 0.00141367283261692 | 0.00141030051357423 | -0.238550176877922 |
| gelu | torch.float16 | 16777216 | 2.73E-05 | 3.17E-05 | 15.921079070745 |
| gelu | torch.float16 | 33554432 | 5.06E-05 | 5.55E-05 | 9.76345374333098 |
| gelu | torch.float16 | 67108864 | 9.65E-05 | 0.000106600326641152 | 10.4308039074712 |
| gelu | torch.float16 | 134217728 | 0.000187776672343413 | 0.000208565829476962 | 11.0712139447915 |
| gelu | torch.float16 | 268435456 | 0.000370216167842348 | 0.000412251994324227 | 11.3544005187205 |
| gelu | torch.float16 | 536870912 | 0.000737301345604161 | 0.000819394170927505 | 11.1342296895002 |
| gelu | torch.bfloat16 | 16777216 | 3.02E-05 | 3.08E-05 | 1.78405479367653 |
| gelu | torch.bfloat16 | 33554432 | 5.13E-05 | 5.69E-05 | 10.9929393318302 |
| gelu | torch.bfloat16 | 67108864 | 9.76E-05 | 0.00010968199543034 | 12.3420807512356 |
| gelu | torch.bfloat16 | 134217728 | 0.000189661824454864 | 0.000214487663470209 | 13.0895287371091 |
| gelu | torch.bfloat16 | 268435456 | 0.000374197009174774 | 0.000423670164309442 | 13.2211519391275 |
| gelu | torch.bfloat16 | 536870912 | 0.000743675006863972 | 0.000842577001700799 | 13.299088166737 |
| gelu | torch.float32 | 16777216 | 5.06E-05 | 5.04E-05 | -0.413385894716413 |
| gelu | torch.float32 | 33554432 | 9.31E-05 | 9.32E-05 | 0.134157041722546 |
| gelu | torch.float32 | 67108864 | 0.000181480175039421 | 0.000180836669945469 | -0.354586992112075 |
| gelu | torch.float32 | 134217728 | 0.000356874331676712 | 0.000356305002545317 | -0.159532104402047 |
| gelu | torch.float32 | 268435456 | 0.000708909006789327 | 0.000706991491218408 | -0.270488250615287 |
| gelu | torch.float32 | 536870912 | 0.00141321367118508 | 0.00140937082081412 | -0.271922813181618 |
| sin | torch.float16 | 16777216 | 3.04E-05 | 3.11E-05 | 2.21834939018859 |
| sin | torch.float16 | 33554432 | 4.85E-05 | 5.23E-05 | 7.72165512511596 |
| sin | torch.float16 | 67108864 | 9.31E-05 | 9.98E-05 | 7.24947099480072 |
| sin | torch.float16 | 134217728 | 0.000180371008658161 | 0.000194791161144773 | 7.99471744039613 |
| sin | torch.float16 | 268435456 | 0.000355454161763191 | 0.000384903668115536 | 8.28503630574026 |
| sin | torch.float16 | 536870912 | 0.000705183832906187 | 0.000764360166310022 | 8.39161799270973 |
| sin | torch.bfloat16 | 16777216 | 3.11E-05 | 3.10E-05 | -0.257677954940036 |
| sin | torch.bfloat16 | 33554432 | 4.89E-05 | 5.24E-05 | 7.34808420323539 |
| sin | torch.bfloat16 | 67108864 | 9.26E-05 | 0.000100248667877167 | 8.22347488801205 |
| sin | torch.bfloat16 | 134217728 | 0.000180674154156198 | 0.00019567032965521 | 8.30012215584937 |
| sin | torch.bfloat16 | 268435456 | 0.000355360486234228 | 0.000386023331278314 | 8.62865913118873 |
| sin | torch.bfloat16 | 536870912 | 0.00070483615854755 | 0.000766805159704139 | 8.79197248964745 |
| sin | torch.float32 | 16777216 | 5.67E-05 | 5.64E-05 | -0.441348534920039 |
| sin | torch.float32 | 33554432 | 9.34E-05 | 9.30E-05 | -0.496458540364117 |
| sin | torch.float32 | 67108864 | 0.000181706990891447 | 0.000180556671693921 | -0.633062708199702 |
| sin | torch.float32 | 134217728 | 0.000356894995396336 | 0.000356046327700218 | -0.237791985616354 |
| sin | torch.float32 | 268435456 | 0.000708777321657787 | 0.000707602652255446 | -0.165731798471427 |
| sin | torch.float32 | 536870912 | 0.00141263716310884 | 0.00140912582476934 | -0.248566187496451 |
| exp | torch.float16 | 16777216 | 3.00E-05 | 3.04E-05 | 1.40099098901014 |
| exp | torch.float16 | 33554432 | 4.86E-05 | 5.03E-05 | 3.44611943643906 |
| exp | torch.float16 | 67108864 | 9.37E-05 | 9.55E-05 | 1.96412400380129 |
| exp | torch.float16 | 134217728 | 0.000180913504057874 | 0.000187193179347863 | 3.47109262113439 |
| exp | torch.float16 | 268435456 | 0.00035607748820136 | 0.000369079003576189 | 3.65131630210701 |
| exp | torch.float16 | 536870912 | 0.000707551507124056 | 0.000732363162872692 | 3.50669251620789 |
| exp | torch.bfloat16 | 16777216 | 2.98E-05 | 3.04E-05 | 1.74345594341654 |
| exp | torch.bfloat16 | 33554432 | 4.88E-05 | 5.04E-05 | 3.40217856534821 |
| exp | torch.bfloat16 | 67108864 | 9.32E-05 | 9.62E-05 | 3.29219958210226 |
| exp | torch.bfloat16 | 134217728 | 0.000180999826019009 | 0.000187239318620414 | 3.44723679499521 |
| exp | torch.bfloat16 | 268435456 | 0.000355944503098726 | 0.000369370992605885 | 3.77207384585864 |
| exp | torch.bfloat16 | 536870912 | 0.000707135167128096 | 0.000733066000975668 | 3.66702648277075 |
| exp | torch.float32 | 16777216 | 4.89E-05 | 5.63E-05 | 15.1245314346532 |
| exp | torch.float32 | 33554432 | 9.34E-05 | 9.31E-05 | -0.259945454477446 |
| exp | torch.float32 | 67108864 | 0.000181152504713585 | 0.000180474346658836 | -0.374357536939058 |
| exp | torch.float32 | 134217728 | 0.000356771342922002 | 0.000355627329554409 | -0.3206573034212 |
| exp | torch.float32 | 268435456 | 0.000708404501589636 | 0.00070713268360123 | -0.179532736671163 |
| exp | torch.float32 | 536870912 | 0.00141283582585553 | 0.00140944866385932 | -0.23974208002295 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-01-29 13:32:59 +00:00
354fe48db9 Add magma cuda build 12.8 (#145765)
https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145765
Approved by: https://github.com/malfet
2025-01-29 08:43:38 +00:00
501c5972f0 [pytorch] raise exception when calling dim order on sparse tensor (#145888)
This diff introduces a change to the PyTorch library that raises an exception when calling the `dim_order` method on a sparse tensor.

Differential Revision: [D68797044](https://our.internmc.facebook.com/intern/diff/D68797044/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145888
Approved by: https://github.com/Jack-Khuu
2025-01-29 06:15:44 +00:00
2e8c080ab1 [inductor][4/N] triton support post-#5512, fix constexpr signatures (#145583)
Prior to this PR, constexprs were appearing in signatures as `{.. "XBLOCK : tl.constexpr": "constexpr"}` when they really should appear as `{.. "XBLOCK": "constexpr"}`.

This PR represents the argument names as ArgName objects, which can optionally be marked as constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145583
Approved by: https://github.com/jansel
2025-01-29 05:46:05 +00:00
3f77002b96 [dynamo][builtin-skipfiles-cleanup] remove abc, enum, importlib (#145892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145892
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
ghstack dependencies: #145856, #145875, #145878
2025-01-29 05:30:06 +00:00
236793684d [dynamo][builtin-skipfiles-cleanup] Remove threading, _collections_abc, _weakrefset, threading (#145878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145878
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
ghstack dependencies: #145856, #145875
2025-01-29 05:30:06 +00:00
a479656cd2 [dynamo][builtin-skipfiles-removal] Remove logging (#145875)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145875
Approved by: https://github.com/williamwen42
ghstack dependencies: #145856
2025-01-29 05:29:58 +00:00
64ee57847b [dynamo][builtin-skipfiles-cleanup] Remove some builtins (#145856)
[dynamo][builtin-skipfiles-cleanup] Remove more builtins

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145856
Approved by: https://github.com/zou3519
2025-01-29 05:29:47 +00:00
7178b827d7 PEP585: Missed conversions (#145342)
Differential Revision: [D68785969](https://our.internmc.facebook.com/intern/diff/D68785969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145342
Approved by: https://github.com/bobrenjc93
2025-01-29 05:24:36 +00:00
8696e59ae2 add test for capture_dynamic_output_shape_ops=True changing expected output between eager and compiled versions (#145821)
Followup from https://github.com/pytorch/pytorch/issues/130290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145821
Approved by: https://github.com/eellison, https://github.com/ezyang
2025-01-29 04:36:32 +00:00
776bdb962c [ONNX] Support subgraphs with 1+ outputs (#145860)
Fixed a bug in _handle_output_node where additional output values were not added as graph outputs

Fixes #145734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145860
Approved by: https://github.com/titaiwangms
2025-01-29 04:13:23 +00:00
cyy
fd515e4f59 Fix C++20 Wambiguous-reversed-operator warnings (#144126)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144126
Approved by: https://github.com/albanD
2025-01-29 03:13:57 +00:00
90a6db4a9c [be][pytorch] Fix backend in autocast (#145859)
Summary: fixing backend typo (BAKCNEDS -> BACKENDS)

Test Plan: ci

Differential Revision: D68573324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145859
Approved by: https://github.com/jvandebon
2025-01-29 03:13:08 +00:00
9be2e88d41 Fix lowering to inductor IR for triton CPU (#144389)
Example failing test:
 `pytest -s test_torchinductor_opinfo.py  -k test_comprehensive_special_polygamma_special_polygamma_n_0_cpu_float32` when using triton CPU.

Failure:
```shell
triton.compiler.errors.CompilationError: at 10:11:
def triton_poi_fused_polygamma_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 25
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = 1.0
    tl.static_assert(tmp1.dtype == tl.float32)
    tmp2 = ops.polygamma(tmp1, tmp0)
           ^
NameError('ops is not defined')
```
This occurs because the registered triton fallbacks are not used during the lowering to inductor IR.

Marked the problematic code in the excerpt below from 6bc17b0725/torch/_inductor/lowering.py (L572)

```python
def make_pointwise(
    fn,
    override_return_dtype=None,
    override_device=None,
    override_fn_when_input_bool=None,
    override_fn_when_gpu_float64=None,
    allow_alpha=False,
    triton_fallback=None,
):
    def inner(*inputs: TensorBox, alpha=None):
        if triton_fallback is not None and any(
            isinstance(inp, IRNode) and is_triton(inp) for inp in inputs <--- is_triton should return True when using triton CPU
        ):
            assert not allow_alpha  # not implemented
            return triton_fallback(*inputs)

        inputs = promote_constants(inputs, override_return_dtype)
        if allow_alpha:
            if alpha is not None and alpha != 1:
                inputs = list(inputs)

```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144389
Approved by: https://github.com/jansel
2025-01-29 03:10:53 +00:00
50f834f134 [export] allow bit shift builtin ops (#145802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145802
Approved by: https://github.com/pianpwk
2025-01-29 03:05:48 +00:00
f4ca98950e Add CUDA 12.8 libtorch image (#145789)
https://github.com/pytorch/pytorch/issues/145570

Builds 12.8 libtorch docker/deprecate 12.1 meanwhile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145789
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-01-29 02:59:37 +00:00
9330b6d098 Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)
Summary:

Test Plan:

Differential Revision: [D68751149](https://our.internmc.facebook.com/intern/diff/D68751149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829
Approved by: https://github.com/Chillee
2025-01-29 02:52:55 +00:00
9fd6722fc9 [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-29 02:48:56 +00:00
29521256e1 [Customized Optimus][Inductor] Add split cat pattern in aten level (#145721)
Summary:
Thanks Microve for discovering that recGPT has some repeated similar kernels that might be optimized through optimus. After investigation, I designed a pattern in the aten level to remove such excessive kernels.

trace: https://fburl.com/perfdoctor/82fauil7
tlparse: https://fburl.com/98q6tadx

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/e8458d63-b8ca-498b-a731-77a83fb4d1cb
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16325548715106567
Network: Up: 341KiB  Down: 359KiB  (reSessionID-7d3de666-7fc1-4988-8d11-d75ba958016d)
Executing actions. Remaining     0/3
Command: test.     Finished 2 local
Time elapsed: 3:04.8s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# local run

```
buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=local_recgpt_ranking_30x_v0_unified_seq_1115
```

https://www.internalfb.com/mlhub/pipeline/1630903954173593

# E2E

```
buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=mast_recgpt_ranking_30x_v0_unified_seq_1115 launcher.oncall=ads_model_platform launcher.data_project=ai_large_scale launcher.fbl_entitlement=ads_global_tc_training_efficiency launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] launcher.hardware=SMC_T20 launcher.job_name=recgpt_ranking_1115_pt2_with_optimus data_loader.dataset.table_ds=[2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18]
```

### how to add the config
Add the following patterns to the dynamo config

```
        post_grad_fusion_options: {
          "normalization_aten_pass": {},
          "split_cat_aten_pass": {},
        }
```

{F1974700331}

baseline:
aps-recgpt_ranking_1115_pt2_5-8cb4905c7d

{F1974700216}

proposal:

Differential Revision: D68695717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145721
Approved by: https://github.com/Yuzhen11
2025-01-29 01:59:06 +00:00
331f49057d Removes threadfence from topk kernel to improve AMD performance (#145536)
Also marginally improves cuda perf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145536
Approved by: https://github.com/eqy
2025-01-29 01:29:15 +00:00
6f5c8fb128 [DTensor] Add pointwise ops strategy for aten.minimum (#145816)
Need it for Shampoo optimizer.
9c5700ad5e/matrix_functions.py (L240-L242)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145816
Approved by: https://github.com/XilunWu
2025-01-29 01:19:01 +00:00
15e37e4253 [export] don't always print GM in serdes logging (#145857)
Summary: Didn't realize print_readable() would also print and not just return string

Test Plan: .

Differential Revision: D68781525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145857
Approved by: https://github.com/angelayi, https://github.com/yiming0416
2025-01-29 01:03:02 +00:00
a24b25942a Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)
Fixes #140092

Here's what this PR does:

Case 1: no `eps`  is passed to python frontend:
Use `eps` associated with opmath_t instead of than `eps` associated with`scalar_t` for intermediate computation

Case 2: `eps` is passed to python frontend
Avoid downcasting `eps` to `scalar_t` and then upcasting it again implicitly in the `rqrst_input` computation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848
Approved by: https://github.com/albanD
2025-01-29 01:01:44 +00:00
ae0f305bf9 [inductor] Make triton kernel autotune config defaults backward-compatible (#145494)
If a model was torch.packaged using triton<=3.1, any user-defined
autotuned kernels will have reps/warmups burned in with the old defaults
(100/25).  If this model is loaded with triton>=3.2, inductor's checks for
unsupported non-default autotune args will fail, because triton.Autotuner's
defaults for these parameters has changed to `None`.  Let's explicitly support
those values for backward compatibility with these older models.

Differential Revision: [D68561014](https://our.internmc.facebook.com/intern/diff/D68561014/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145494
Approved by: https://github.com/aorenste
2025-01-29 00:31:39 +00:00
9036a22c83 [Inductor][Triton] Change propagated dtype for fp16/bf16 unwrapped 0d tensors (#145613)
Fixes TestInductorOpInfoCPU.test_comprehensive_max_binary_cpu_float16 and related tests for Triton CPU. TestInductorOpInfoCPU is currently not run in the CI. See https://github.com/pytorch/pytorch/pull/144389#issuecomment-2608050755 for some additional context.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145613
Approved by: https://github.com/davidberard98, https://github.com/eellison, https://github.com/jansel
2025-01-29 00:23:44 +00:00
2f24f2eb46 Make sure to evaluate annotation strings in the context of where the prototype was created (#145667)
This was incorrectly evaluating the annotation in the context of infer_schema - make sure to evaluate annotation strings in the context of where the prototype was created instead.

Fixes #145481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145667
Approved by: https://github.com/zou3519
2025-01-29 00:14:45 +00:00
82859f6185 [associative_scan] scan dim handling in user-facing associative_scan() (#139864)
This PR implements the user-facing dim change, i.e., that the scan dim provided by the user is always moved to dim 0 and then the associative_scan operation always operates on dim 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139864
Approved by: https://github.com/ydwu4
2025-01-28 23:58:10 +00:00
7ca156f0ee partitioner: avoid inserting duplicates into heap (#145082)
Fixes https://github.com/pytorch/pytorch/issues/145081

This looks like it was a source of quadratic compile times in the torchtitan CP graphs. There's some code in the partitioner that iteratively adds users of a node to a heap, and pops the earliest user. If you have long parallel chains of fusible ops that all eventually feed into some shared ops, then this can result in:
(1) a node getting added to the heap many times
(2) each time we pop that node, we add (duplicates of) each of that node users to the heap
(3) repeat with each user

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145082
Approved by: https://github.com/xmfan
2025-01-28 23:44:45 +00:00
02dd7a7803 Extend abi-stable nitpick message to all the c stable files (#145862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145862
Approved by: https://github.com/ezyang
2025-01-28 23:22:23 +00:00
049f042e52 Update build_wheel.sh 2025-01-28 15:14:41 -08:00
eea7d395e5 [CD] Install ninja and setuptools from PyPI (#145871)
Rather than Conda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871
Approved by: https://github.com/Skylion007
ghstack dependencies: #145870
2025-01-28 23:09:38 +00:00
c26bb9ba5b [CMake] Find HomeBrew OpenMP on MacOS (#145870)
Either via `OMP_PREFIX` envvar or just searching in that folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870
Approved by: https://github.com/Skylion007
2025-01-28 23:09:37 +00:00
f388ba5986 Update CUDNN frontend submodule to 1.10.0 (#145780)
Update to CUDNN 1.10. Most of this is release is about supporting some new APIs needed for Blackwell integration and new features in the corresponding CUDNN version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145780
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/malfet
2025-01-28 22:54:24 +00:00
af43b445a5 [ONNX] Set USE_EXPERIMENTAL_LOGIC to True (#137296)
This sets dynamo_export to use the new export logic. The legacy dynamo export logic will be removed as a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137296
Approved by: https://github.com/titaiwangms
2025-01-28 22:35:11 +00:00
5aa5a5763e [inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684)
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:

1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.

To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy
2025-01-28 22:01:08 +00:00
1ffed44b42 [aotinductor] update unbacked symint runtime assertion msg (#145569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145569
Approved by: https://github.com/chenyang78
2025-01-28 21:42:58 +00:00
a06a18b1bb [ATen] Implement exception handling for hipsolver APIs (#145839)
Summary: TSA

Test Plan: CI

Differential Revision: D68741194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145839
Approved by: https://github.com/Mellonta
2025-01-28 21:37:23 +00:00
9003d81144 change the test wheel to release wheel when release wheel available (#145252)
change the test wheel to release wheel when release wheel available

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145252
Approved by: https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-28 21:23:53 +00:00
4f949f282d [c10d][ez] Remove goto in PGNCCL and make linter happy for PGNCCL and NCCLUtils (#145855)
While working on PGNCCL I found that the code triggers some lint warnings so this PR is to address them or add lint suppressor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145855
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2025-01-28 21:19:49 +00:00
6bcb545d9c [CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups (#145793)
Make ci cusparselt installation be consistent with nightly binary
Remove cu121 related docker build jobs and inductor runs Update test failures relating to cu121

Retry of https://github.com/pytorch/pytorch/pull/145696
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145793
Approved by: https://github.com/eqy, https://github.com/tinglvv
2025-01-28 21:01:58 +00:00
ccc2878c97 Fix fractional_max_pool lowering in inductor (#144395)
Fixes https://github.com/pytorch/pytorch/issues/141538
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144395
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-28 21:00:18 +00:00
ef28df5c9e [Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593)
Reland of #137843 , after checking the code again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140593
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-28 20:51:49 +00:00
3481c2aec4 Revert "[dynamo] save/restore system random state more carefully (#145750)"
This reverts commit e3d3f2b22e4b75c64eaa2f940a2dd80c1e43435c.

Reverted https://github.com/pytorch/pytorch/pull/145750 on behalf of https://github.com/eellison due to bisected perf regression ([comment](https://github.com/pytorch/pytorch/pull/145750#issuecomment-2620028414))
2025-01-28 20:51:07 +00:00
28982ceb3b [aarch64] Rebuild everything with ArmPL (#145742)
Summary: Rebuild everything that used OpenBLAS with ArmPL

Test Plan: CI, prod test

Reviewed By: Nicoshev

Differential Revision: D68219559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145742
Approved by: https://github.com/malfet
2025-01-28 20:48:42 +00:00
edf266e9bb inductor.config.descriptive_names = False is not actually supported (#145523)
Summary:
This config is not supported (it throws an error when set), and doesn't really make sense imo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145523
Approved by: https://github.com/eellison
2025-01-28 20:22:23 +00:00
515e55e692 Set -DPy_LIMITED_API flag for py_limited_api=True extensions (#145764)
This could be BC breaking, because there was a period of time when we use py_limited_api=True but don't enforce the flag, and now that we will start enforcing the flag, people's custom extensions may fail to build.

This is strictly still better behavior, as it is sketchy to claim CPython agnosticism without the flag, but calling this out as potential people yelling at us. Ways to mitigate this risk + reasons this may not be too big a deal:
- People haven't known about py_limited_api for extensions much due to lack of docs from python so usage is low right now
- My current tutorial is in store to make new users of py_limited_api pass this flag, so it'd be a noop for them.

Test plan:
* Locally i'm confident as I tried rebuilding ao with this change and it reliably failed (cuz importing torch/extension.h is a nono)
* Unit test wise, the normal python_agnostic one I added should work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145764
Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD
2025-01-28 20:11:05 +00:00
8d91bfd965 [BE] Include CheckFunctionExists in FindBLAS.cmake (#145849)
It's used in the script, so it must be included
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145849
Approved by: https://github.com/Skylion007
2025-01-28 19:47:05 +00:00
eaff13275e [dynamo] Properly branch on an unspecialized NN module (#145786)
User defined NN module might have their own `__len__` or `__bool__`
methods which Dynamo needs to trace through, so that side effects and/or
reads to buffered writes are properly handled.

This patch removes the special `UnspecializedNNModuleVariable` branch in
Dynamo's branch handling, and lets these cases fall into the
`UserDefinedObjectVariable` branch, which handles the aforementioned
cases correctly.

Fixes #145284.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145786
Approved by: https://github.com/williamwen42
2025-01-28 19:45:17 +00:00
d9ffa5da65 Log info for AOTAutogradCache bypasses instead of warning (#145768)
Fixes #145767

FxGraphCache also logs to info instead of warning so lets do that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145768
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2025-01-28 19:25:36 +00:00
6c09954a9e Windows builds with VS2022 (#145319)
[Fixes #ISSUE_NUMBER
](https://github.com/pytorch/pytorch/issues/128835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145319
Approved by: https://github.com/huydhn
2025-01-28 19:07:24 +00:00
cbc4094298 [draft_export] add LOC for data-dep error logging (#145443)
Summary:
maybe this is too much info, but it's difficult to go through old draft export reports where the stack trace is out of sync with the current codebase. Data-dependent errors now look like:
```
2. Data dependent error.
    When exporting, we were unable to evaluate the value of `u306`.
    This occurred at the following stacktrace:
    File /data/users/pianpwk/fbsource/buck-out/v2/gen/fbcode/78204cab86e8a0fb/sigmoid/inference/ts_migration/__pt2i_readiness_main__/pt2i_readiness_main#link-tree/caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/embedding_bag_proxy.py, lineno 109, in _forward_impl:
         `if offsets[-1] > len(input):`
    As a result, it was specialized to evaluate to `261`, and asserts were inserted into the graph.
    Please add `torch._check(...)` to the original code to assert this data-dependent assumption.
    Please refer to https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit#heading=h.boi2xurpqa0o for more details.
```

This would be even more helpful for reports on torch-packaged models, but that requires some more work on PT2I-specific stack trace processing

Test Plan: .

Differential Revision: D68534017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145443
Approved by: https://github.com/angelayi
2025-01-28 18:55:16 +00:00
c32bafeb0b [ROCm] Bump AOTriton to 0.8.2b (#145508)
We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem.

Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases,  but it is considered experimental and will not be enabled right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508
Approved by: https://github.com/jeffdaily
2025-01-28 18:34:25 +00:00
621604ce46 Maintain multiple configs (#145103)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Previously, we would finalize the config of a triton template after its first fusion. this maintains multiple configs, in case we epilogue fuse, then prologue fuse, and prologue fusion has a new better config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145103
Approved by: https://github.com/jansel, https://github.com/shunting314
ghstack dependencies: #143408
2025-01-28 18:32:14 +00:00
eaec97ab1f [dynamo] Properly prune dead input cell object (#145781)
This patch models input cell object as "newly created" rather than
"pre-existing" python object (see added documentation for why this
actually captures the semantics more accurately).

This enables the `SideEffects.prune_dead_object_new` algorithm to prune
away writes to input cell objects which are no longer relevant; this
didn't happen prior to this patch because we modelled them as
pre-existing objects, which forces us to codegen their attribute
mutations.

Fixes #145564.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145781
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-28 18:28:13 +00:00
8e258e2ecd Parallelize epilogue/prologue benchmarking (#143408)
When we attempt prologue or epilogue fusion with a TritonTemplate, we benchmark it at compile time in order to determine profitability. This avoids slowdowns/register spilling, and allows us to pick fusion when a base triton template is slower than cublas but faster when considering an epilogue. However, that fused benchmarking does not do the same async compilation as we do for the base TritonTemplate. The Base TritonTemplate is async compiled during lowering, then later waited on and benchmarked.

This PR extends a similar process to benchmarking fused TritonTemplates in the scheduler. We keep a list of pending fusions which have async compilations. And we resolve any pending fusions a node is in prior to attempting to fuse it with any other node.

Initially, I saw some slowdowns with this because we kick off async compilations of identical fusions in parallel. To address this I added source code caching at the `async_compile` level (we also already cache benchmark runs, but that would not happen in parallel).

Compilation speedups:

<img width="717" alt="image" src="https://github.com/user-attachments/assets/8e8f7d6c-7824-4210-83f9-a2a0f6db5ac9" />

This also should let us be a bit more aggressive with either configs, or benchmarking other fusions which are hard to determine profitability of.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143408
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-01-28 18:18:24 +00:00
3fd4691908 [MPS] Add op_math_t (#145808)
Similar to `at::opmath_t` to be used for reduction (and int mms)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808
Approved by: https://github.com/dcci
2025-01-28 18:03:52 +00:00
5382ab57d7 Move trunk windows builds to CUDA-12.4 (#145844)
Same as : https://github.com/pytorch/pytorch/pull/130446

That should catch build regressions that were previously only detectable during the nightly builds for 12.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145844
Approved by: https://github.com/janeyx99, https://github.com/malfet
2025-01-28 18:00:51 +00:00
56915b093a Fix environment deployment spam (#145823)
With https://github.com/pytorch-labs/pytorch-gha-infra/pull/598 in place, the environment can now be removed.

Fixes https://github.com/pytorch/pytorch/issues/145704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145823
Approved by: https://github.com/clee2000
2025-01-28 17:46:31 +00:00
cfbb27462e Revert "[inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373)"
This reverts commit b8087747f5ca7be0d37b1ac85dc0894f6a33e3a3.

Reverted https://github.com/pytorch/pytorch/pull/145373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145373#issuecomment-2619674197))
2025-01-28 17:46:11 +00:00
dbef2a9bc9 Revert "Remove lexicographical sorting of storage keys in torch.save (#143879)"
This reverts commit 7db0afabaaff17dd37cf846cd786610ebf6aedd3.

Reverted https://github.com/pytorch/pytorch/pull/143879 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68746524 for details ([comment](https://github.com/pytorch/pytorch/pull/143879#issuecomment-2619661492))
2025-01-28 17:40:16 +00:00
097ccd9c39 Move ROCm MI300 jobs to unstable to make CI green (#145790)
This is a temporary change to reduce intermittent tests failures. Jobs can be moved back once those machines get better runner isolation.

This also sneaks in a small fix to all the rocm job's build step to be run on Linux Foundation runners (the get-label-type dependency).  The inductor-rocm-mi300 workflow already had it, but it was missing in the rocm-mi300 workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145790
Approved by: https://github.com/yangw-dev
2025-01-28 17:25:15 +00:00
7eb51e5464 Ensure GPU isolation for kubernetes pod MI300 runners. (#145829)
Fixes the reason behind moving the tests to unstable initially. (https://github.com/pytorch/pytorch/pull/145790)
We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here.
Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145829
Approved by: https://github.com/jeffdaily
2025-01-28 17:20:46 +00:00
cyy
c751541e79 Fix cppcoreguidelines-init-variables ignorance (#141795)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795
Approved by: https://github.com/albanD
2025-01-28 17:11:37 +00:00
ac87388e61 [AOTInductor] Refactor CPU and GPU to remove ifdef macros (#145639)
Summary: Remove #ifdef USE_CUDA macros through some refactor

Test Plan: Refactor code, existing tests.

Differential Revision: D68636743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145639
Approved by: https://github.com/desertfire
2025-01-28 16:46:00 +00:00
6967ef1b07 [ROCm] fix test_cublas_workspace_explicit_allocation for gfx12 (#145227)
gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300
Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300.
Now `default_workspace_size=32MB` is used for gfx12 and the test passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2025-01-28 16:19:27 +00:00
80a0412b76 [dynamo][builtin-skipfiles-cleanup] Remove posixpath (#145828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145828
Approved by: https://github.com/zou3519
ghstack dependencies: #145744, #145753, #145826
2025-01-28 16:14:34 +00:00
6824a4a75d [dynamo][builtin-skipfiles-cleanup] Remove re (#145826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145826
Approved by: https://github.com/zou3519
ghstack dependencies: #145744, #145753
2025-01-28 16:14:34 +00:00
4307e6c008 [dynamo][builtin-skipfile-cleanup] Remove signal (#145753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145753
Approved by: https://github.com/zou3519
ghstack dependencies: #145744
2025-01-28 16:14:23 +00:00
3a56089217 fix unbacked + view incorrectness (#145548)
fix for https://github.com/pytorch/pytorch/issues/143498

We were incorrectly using contiguous strides for a non-contiguous tensor. There are two separate causes:

1. https://github.com/pytorch/pytorch/pull/110520 made it so we turn Views contiguous with unbacked symints becuase
`dynamic_reshape_indexer below will fail due to the size_hint's inability to process unbacked SymInts`. Seems like we should fix. Regardless - it will make the input contiguous if input is unbacked to workaround this.

2. We weren't actually making it contiguous! I filed an issue for this here: https://github.com/pytorch/pytorch/issues/145561.

This is still worth landing as a fix, even though we should those issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145548
Approved by: https://github.com/desertfire
2025-01-28 16:03:45 +00:00
97b3b73f3e [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-01-28 15:21:12 +00:00
a08f7f3266 OpenReg: fix issue of pin_memory (#145046)
Fix issue of `pin_memory` when rewrapping a storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145046
Approved by: https://github.com/albanD
2025-01-28 09:41:04 +00:00
bdf6dfa17d [chore][ez] change alloc buffer size from 4000 to 4096 (#145759)
Summary:
Allocations typically happen as a power of 2 anyway.
Change the default alloc size to 4096 so eek out a bit more perf.

Test:
unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145759
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756, #145757
2025-01-28 09:14:07 +00:00
5c5306e8bc [dynamo][builtin-skiplist-cleanup] Remove weakref (#145744)
WeakKeyDictionary already works very nicely with the UserDefinedObject Variable Tracker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145744
Approved by: https://github.com/jansel
2025-01-28 07:55:12 +00:00
45f64e770a relax assertion to warning for unbacked binding names (#145777)
Summary:
Quick fix following up on https://github.com/pytorch/pytorch/pull/144894 to unblock internal tests.

Will keep investigating a more principled fix.

Test Plan: Failures in T213563826 now pass

Differential Revision: D68731710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145777
Approved by: https://github.com/angelayi
2025-01-28 07:52:40 +00:00
0a8a0ef767 [inductor] Fix crash running wrapper_benchmark with no device (#145644)
Fixes #145434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145644
Approved by: https://github.com/shunting314
2025-01-28 07:31:36 +00:00
a699034eec Record inputs at time of tracing, constrain to them for triton fn (#145448)
Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context.

We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448
Approved by: https://github.com/zou3519
2025-01-28 07:07:14 +00:00
0f5a68344a [BE][Inductor] Simplify custom_op tests (#145814)
Not sure what were the motivation behind repeating the same function over and over again for different backends
Change `test_custom_op_[123]` from acceptig separate (but identical) implementations for CPU, CUDA and XPU, to take just `fn` and `fn_meta` args

Test that it also extendable to MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145814
Approved by: https://github.com/jansel
2025-01-28 05:58:51 +00:00
23eb0a3201 Improve typing in torch/types.py (#145237)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145237
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2025-01-28 05:29:12 +00:00
8e46d0f595 [BE]: Update typing of OrderedSet ancestor (#145783)
Now that we are on python 3.9 minimum version we can properly use Generics in the superclass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145783
Approved by: https://github.com/eellison
2025-01-28 04:43:49 +00:00
cyy
67fcc7cf02 [3/N] Remove unnecessary once flag usage (#145672)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672
Approved by: https://github.com/albanD
2025-01-28 04:28:18 +00:00
01a4d86b31 add pt2 callbacks for backward pass and prevent duplicate callbacks (#145732)
Summary: This change adds callbacks for lazy backwards compilation while preventing duplicate callbacks to be fired.

Differential Revision: D68577593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145732
Approved by: https://github.com/mlazos
2025-01-28 03:50:02 +00:00
1a26cdd5cb [cond] remove warning for unsupported tuple returns (#145766)
I guess this is supported now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145766
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-01-28 03:13:36 +00:00
9010649292 Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)"
This reverts commit db3685a35cdce32622ab89f6c92e09d52210ff53.

Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))
2025-01-28 03:07:17 +00:00
78f02bf07c [bug] handle case when remote peer closes connection (#145757)
Summary:
In the case where remote peer closes the connection, nread returns 0. In
this case, we still want to free up the allocated buffer.
Also, reorder the if so that the likely success cases (nread > 0) is at
the top of the function with an early return.

Test Plan:
unit tests

Differential Revision: [D68733192](https://our.internmc.facebook.com/intern/diff/D68733192)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145757
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756
2025-01-28 03:06:38 +00:00
4be831ba2d [draft_export] fix dense-in-memory check for inferring fakes (#145653)
Test Plan: fixes check for dense tensors with size-1 dimensions

Differential Revision: D68644028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145653
Approved by: https://github.com/zou3519
2025-01-28 02:52:14 +00:00
7c1fc0a047 Log cache state for AOTAutograd in title of file (#145715)
Differential Revision: [D68692755](https://our.internmc.facebook.com/intern/diff/D68692755/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145715
Approved by: https://github.com/bobrenjc93
2025-01-28 02:14:18 +00:00
78a94c9114 [inductor] Remove type ignores from scheduler.py (#145712)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145712
Approved by: https://github.com/yanboliang, https://github.com/Skylion007
ghstack dependencies: #145692
2025-01-28 01:44:32 +00:00
2df2f9d895 [inductor] Change type of get_backend_features to OrderedSet (#145692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145692
Approved by: https://github.com/yanboliang
2025-01-28 01:44:32 +00:00
db33d23aa8 [SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144886
Approved by: https://github.com/awgu
2025-01-28 01:43:37 +00:00
e3d3f2b22e [dynamo] save/restore system random state more carefully (#145750)
Reattempt of https://github.com/pytorch/pytorch/pull/145435 since the state of the linked internal diff appears to be messed up.

Note: I have verified that the previously failing internal tests now pass internally.

Differential Revision: [D68723334](https://our.internmc.facebook.com/intern/diff/D68723334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145750
Approved by: https://github.com/StrongerXi
2025-01-28 01:34:13 +00:00
f16ce3c7e9 Refactor fuzzer and add support for Dynamo (#145565)
## Summary:
Dynamo now works with config fuzzer.

For BE week, we also found and fixed 5 different bugs (in inductor):
- https://github.com/pytorch/pytorch/pull/145426
- https://github.com/pytorch/pytorch/pull/145523
- https://github.com/pytorch/pytorch/pull/145527
- https://github.com/pytorch/pytorch/pull/145532
- https://github.com/pytorch/pytorch/pull/145538

## Test Plan:
New Dynamo Unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145565
Approved by: https://github.com/masnesral
2025-01-28 00:44:27 +00:00
6eb74fbec6 Updates NCCL user buffer registration test for NCCL 2.24.3 (#145285)
NCCL 2.24.3 changed the content of the debug output for NVLS registration. We use this debug output in our test suite to check if NVLS was successfully registered or not. Hence we need to specialize for the NCCL version in the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145285
Approved by: https://github.com/kwen2501
2025-01-28 00:24:53 +00:00
5a4d959cdb [dynamo] Properly model torch profiler context objects (#145537)
Prior to this patch, Dynamo conveniently modelled torch profiler context
objects (e.g., `torch.profiler.profile`) as `NullContextVariable`
because `torch.compile` ignore the effect of these profiler contexts.

However, the semantics of these profiler contexts diverges from
`contextlib.nullcontext` in the `__enter__` function, where the former
returns `self` and the latter returns `None`. This causes subtle error
as observed in #125021.

This patch adds back a `ProfilerContextVariable`, which addresses the
aforementioned semantic discrepency.

Fixes #125021.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145537
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-01-28 00:03:36 +00:00
db3685a35c Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-27 23:57:30 +00:00
7db0afabaa Remove lexicographical sorting of storage keys in torch.save (#143879)
Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted)

This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads

Differential Revision: [D67673025](https://our.internmc.facebook.com/intern/diff/D67673025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879
Approved by: https://github.com/albanD
2025-01-27 23:57:30 +00:00
c1161957a4 inductor_config_logging: Don't drop keys (#144700)
This bit me while I was trying to debug some trace issues.
In general this config is already quite large when dumping, so adding
more fields doesn't make it significantly worse.

Also a number of the items we are type checking for (except the test
configs), don't even show up. Primarily this will help us when debugging
rocm, halide, and trace configs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144700
Approved by: https://github.com/ezyang
2025-01-27 23:47:25 +00:00
7d01f6e6f2 Add ignorable commits on run_test.py to git blame ignore (#145787)
Chanced upon it while searching through cpp_extension related code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145787
Approved by: https://github.com/malfet
2025-01-27 23:24:48 +00:00
3ce68dc61e [c10d] Flush file in file recorder (#145458)
Summary:
Flushing file to hopefully prevent file corruptions as reported in
https://github.com/pytorch/pytorch/pull/145125

Test Plan:
Couldn't get file corruption to occur in my tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145458
Approved by: https://github.com/kwen2501
2025-01-27 23:15:52 +00:00
5534c270db [chore] fix new linter (#145756)
Summary:
Fix new linter that's complaining when I made changes to this file:
class 'LibUVStoreDaemon' defines a non-default destructor but does not
define a copy constructor, a copy assignment operator, a move
constructor or a move assignment operator

Test Plan:
make lint passes

Differential Revision: [D68733191](https://our.internmc.facebook.com/intern/diff/D68733191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145756
Approved by: https://github.com/XilunWu, https://github.com/Skylion007, https://github.com/fduwjj
2025-01-27 22:48:12 +00:00
2de53b3b65 Revert "pickler for GraphModule (#141659)"
This reverts commit c6ad08357bf8e766b5220bfb5cbbfdb2a4ec0ca5.

Reverted https://github.com/pytorch/pytorch/pull/141659 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, please take a look at D68694181 for more details. ([comment](https://github.com/pytorch/pytorch/pull/141659#issuecomment-2617045120))
2025-01-27 22:39:30 +00:00
006397fac3 Remove FBGEMM sccache hack (#145664)
Testing https://github.com/pytorch/pytorch/actions/runs/12959358756, sccache is working correctly now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145664
Approved by: https://github.com/wdvr
2025-01-27 22:00:06 +00:00
69e82d02d3 [inductor][3/N] triton support post-#5512, tt.divisibility format (#145575)
1. Fix the tt.divisibility format in hints.py. Previously, it was `{((0,), (1,)): [["tt.divisibility", 16]]}`. Now it is `{(0,): [["tt.divisibility", 16]], (1,): [["tt.divisibility", 16]]}`. This was an oversight in the first PR I added. I've verified that we now get `{ tt.divisibility = 16 }` in the generated TTGIR.
2. Update the test_codegen_triton.py test to work with multiple triton versions (and test this divisibility format in the new triton version)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145575
Approved by: https://github.com/SamGinzburg
2025-01-27 21:48:58 +00:00
993b229665 [dynamo][dicts] Fix dict.__new__ bug (#145723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145723
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #145519, #145547, #145558
2025-01-27 21:42:43 +00:00
7e1c7253e9 [dynamo][builtin-skipfile-cleanup] Support tuple.__new__ (#145558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145558
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #145519, #145547
2025-01-27 21:42:43 +00:00
1ba1b7b597 Support remaining *_like factory functions for NJT (#144889)
Fixes #144761

This PR adds NJT impls for those *_like functions that were previously missing:
* `full_like()`
* `rand_like()`
* `randint_like()`

It also fixes a bug in existing *_like functions when a new device is specified. Fix is to also transfer `offsets` / `lengths` to the new device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144889
Approved by: https://github.com/soulitzer
2025-01-27 21:33:51 +00:00
3a23d75b37 [MPS] Fix c0:🤘:log_gamma correctness on M4 (#145740)
To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 )
```swift
import Metal

func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) {
  guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") }
  let device = library.device
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") }
  guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") }
  guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<nelem {
    ibuf_data[i] = T(sin(Float(2 + i)))
  }
  guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let obuf_data = obuf.contents().assumingMemoryBound(to: T.self)

  computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc))
  computeEncoder.setBuffer(obuf, offset:0, index: 0)
  computeEncoder.setBuffer(ibuf, offset:0, index: 1)
  computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1))
  computeEncoder.endEncoding()
  cmdBuffer.commit()
  cmdBuffer.waitUntilCompleted()

  print("Results for \(String(describing: T.self)):", terminator: " ")
  for i in 0..<nelem {
    print(obuf_data[i], terminator: " ")
  }
  print()
}

let shader_source = """
#include <metal_stdlib>

template<typename T>
float foo(T x) {
  const auto abs_x = :🤘:abs(static_cast<float>(x));
  auto rc = :🤘:log(abs_x);

  return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x)));
}

kernel void half_kernel(
    device half* out_ptr0,
    constant half* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<half>(out);
}

kernel void float_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<float>(out);
}
"""
let options = MTLCompileOptions()
options.mathMode = .safe
options.mathFloatingPointFunctions = .precise

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
let library = try! device.makeLibrary(source:shader_source, options:options)
run_shader(library:library, kernel_name:"half_kernel", type: Float16.self)
run_shader(library:library, kernel_name:"float_kernel", type: Float.self)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740
Approved by: https://github.com/dcci
2025-01-27 21:24:22 +00:00
60f98262f1 PEP585: .github (#145707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145707
Approved by: https://github.com/huydhn
2025-01-27 21:21:01 +00:00
bfaf76bfc6 [dynamo] clear out traced frames at the start of test_log_traced_frames (#145640)
The test was being flaky in CI, and this patch fixes it.

Fixes #137461.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145640
Approved by: https://github.com/williamwen42
2025-01-27 20:49:59 +00:00
93dd6bc4d8 Add CUDA 12.8 installation and manylinux-cuda12.8 (#145567)
Breaking https://github.com/pytorch/pytorch/pull/145557 into two parts.
Need to have manylinux-cuda12.8 in order to build magma.

Issue: https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145567
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-01-27 20:49:07 +00:00
64cd81712d torch.distributions: replace numbers.Number with torch.types.Number. (#145086)
Fixes #144788 (partial)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145086
Approved by: https://github.com/malfet
2025-01-27 20:24:55 +00:00
2f8ad8f4b9 Run inductor perf benchmark on ROCm (#145763)
This requires https://github.com/pytorch/pytorch/pull/144594.  The test run on PT2 dashboard is at https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2020%20Jan%202025%2019%3A46%3A14%20GMT&stopTime=Mon%2C%2027%20Jan%202025%2019%3A46%3A14%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=rocm&lBranch=144594&lCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b&rBranch=144594&rCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145763
Approved by: https://github.com/jeffdaily
2025-01-27 20:19:03 +00:00
66631bc84b [dynamo] Fix read/write conflicts in a cuda test (#145658)
Prior to this patch, the `test_cuda_event_created_outside_of_graph`
is flaky in CI, and that's because we have read and write to the same
`foo` tensor buffer from 2 different streams. This patch eliminates that
by adding a synchronization to wait till read finishes before starting
the write.

Fixes #133837, #133828.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145658
Approved by: https://github.com/yifuwang
2025-01-27 19:55:57 +00:00
c986eba560 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit abf28982a8cb43342e7669d859de9543fd804cc9.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See  D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))
2025-01-27 19:38:26 +00:00
9728e900dc [Inductor][CPP] fix torch logit decomposition (#145576)
**Summary**

Fix issue https://github.com/pytorch/pytorch/issues/145379, current decomposition using `self = torch.clamp(self, lo, hi)` which gives wrong result when `lo` is larger than `hi` comparing to eager implementation: cd68d54911/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L165)
Align their behavior in this PR.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_torch_logit
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145576
Approved by: https://github.com/jgong5, https://github.com/eellison
2025-01-27 19:37:51 +00:00
635b98fa08 Add nitpick warning that aoti_torch/c/shim.h is ABI stable (#145745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145745
Approved by: https://github.com/albanD
2025-01-27 19:25:37 +00:00
bc377c503e [Custom Ops] Fix f-strings in custom ops error message (#145673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145673
Approved by: https://github.com/zou3519
ghstack dependencies: #145588
2025-01-27 19:22:43 +00:00
ec91b7720f [Custom Ops] Add a new API to allow users to register an autocast for the custom op (#145588)
Fixes #137033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145588
Approved by: https://github.com/zou3519
2025-01-27 19:22:43 +00:00
f951d216e0 [autocast][pytorch] Support autocast for MTIA (policy) (#145666)
Summary: Add autocast support for MTIA (policy)

Reviewed By: egienvalue

Differential Revision: D68604796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145666
Approved by: https://github.com/chaos5958
2025-01-27 18:26:04 +00:00
1835e1eb98 [BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145307
Approved by: https://github.com/zou3519, https://github.com/FindHao
2025-01-27 18:12:39 +00:00
835e770bad Use typing.IO[bytes] instead of io.BytesIO in annotations (#144994)
Fixes #144976

Using appoach ① `IO[bytes]`, but could also try with a protocol.

## Notes:

- moved `torch.serialization.FILE_LIKE` to `torch.types.FileLike`
- Use `FileLike` annotation where it makes sense
- made sure those functions also support `os.PathLike`
- Replaced `isinstance(x, io.BytesIO)` with `isinstance(x, (io.IOBase, IO))` where appropriate.
- Replaced `BinaryIO` with `IO[bytes]` (the two ABCs are almost identical, the only difference is that `BinaryIO` allows `bytearray` input to `write`, whereas `IO[bytes]` only `bytes`)
- needed to make `torch.serialization._opener` generic to avoid LSP violations.
- skipped `torch/onnx/verification` for now (functions use `BytesIO.getvalue` which is not part of the `IO[bytes]` ABC, but it kind of seems that this is redundant, as e.g. `onnx.load` supports `str | PathLike[str] | IO[bytes]` directly...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144994
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-01-27 18:08:07 +00:00
abf28982a8 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-27 18:05:23 +00:00
30dea8429d [MPS][BE] Use conveinence methods to set args (#145736)
It's better to call `mtl_setArgs` rather than set arguments one by one with the risk of making a typo

Also, all interactions with MTLCommandBuffer must be serialized, which is commonly done using dispatch queues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145736
Approved by: https://github.com/Skylion007
2025-01-27 17:42:01 +00:00
7db20ffd68 Remove public_allowlist from TestPublicBindings.test_correct_module_names and ensure private_allowlist-ed things are actually private (#145620)
This passes locally, also sanity checked importing these modules on [colab](https://colab.research.google.com/drive/1edynWX1mlQNZIBxtb3g81_ZeTpAqWi19?usp=sharing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145620
Approved by: https://github.com/albanD
2025-01-27 17:30:02 +00:00
5d01a2874f Increase the number of perf benchmark shards (#145534)
Per the discussion on https://github.com/pytorch/pytorch/issues/140332#issuecomment-2610805551, this adds 2 more shards for HF, 2 more for TorchBench, and 1 more for TIMM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145534
Approved by: https://github.com/jeanschmidt
2025-01-27 16:20:42 +00:00
639dd54ef7 [BE] Use copy_method to import all tests (#145718)
Less chances for typo when doing the imports

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145718
Approved by: https://github.com/dcci
2025-01-27 16:01:12 +00:00
2e80093306 setitem node shouldn't be deadcode eliminated (#145714)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/145697. The `operator.setitem` has been eliminated as dead code, causing a correctness issue. Mark it as impure in this PR to avoid this side effect.

**TestPlan**
```
python -u -m pytest -s -v test/fx/test_dce_pass.py -k test_keep_setitem
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145714
Approved by: https://github.com/ezyang
2025-01-27 15:08:21 +00:00
0674ab7e33 solve apl dependency issue (#145215)
According to the [APL documentation](https://developer.arm.com/documentation/101004/2404/General-information/Arm-Performance-Libraries-example-programs), libraries ending with _mp are OpenMP multi-threaded libraries.

When a project is compiled with MSVC and the -openmp flag, the vcomp library (Visual C++ implementation of OpenMP) is used for runtime calls.

However, the current APL implementation uses the libomp.dll (LLVM) variant.

As a result, there are unexpected behaviors at runtime.

---

For Example:

```python
import torch

# Create a sparse tensor
# Input (Sparse Tensor):
# [[0, 1],
#  [1, 0]]
indices = torch.tensor([[0, 1], [1, 0]])
values = torch.tensor([1, 1], dtype=torch.float32)
size = torch.Size([2, 2])

sparse_tensor = torch.sparse_coo_tensor(indices, values, size)

# Convert sparse tensor to dense tensor
dense_tensor = sparse_tensor.to_dense()

# Expected Output (Dense Tensor):
# [[0, 1],
#  [1, 0]]
print("\nDense Tensor:")
print(dense_tensor)
```

However, it prints unexpected outputs such as:

```python
# [[0, 11],
#  [10, 0]]
```

The issue arises because the following code does not function as expected at runtime:

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h#L30

```c++
// returns 1 , however since OpenMP is enabled it should return total number of threads
int64_t num_threads = omp_get_num_threads();
```

---

In the runtime, loading multiple OpenMP libraries (in this case `libomp` and `vcomp`) is causing unexpected behaviours.

So, we've changed libraries from `_mp` to non `_mp` versions and we used `vcomp` for OpenMP calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145215
Approved by: https://github.com/ozanMSFT, https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-27 13:02:16 +00:00
7b6029dcc2 Update slow tests (#145206)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145206
Approved by: https://github.com/pytorchbot
2025-01-27 11:40:39 +00:00
e6c1e6e20e simplify torch.utils.cpp_extension.include_paths; use it in cpp_builder (#145480)
While working on conda-forge integration, I needed to look at the way the include paths are calculated, and noticed an avoidable duplication between `torch/utils/cpp_extension.py` and `torch/_inductor/cpp_builder.py`. The latter already imports the former anyway, so simply reuse the same function.

Furthermore, remove long-obsolete include-paths. AFAICT, the `/TH` headers have not existed since pytorch 1.11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145480
Approved by: https://github.com/ezyang
2025-01-27 07:19:42 +00:00
e90cf4abcf [inductor] Add some typing to common.py (#145691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145691
Approved by: https://github.com/malfet
ghstack dependencies: #145690
2025-01-27 06:27:13 +00:00
ddae87f792 [inductor] Add some typing to simd.py (#145690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145690
Approved by: https://github.com/malfet
2025-01-27 06:27:13 +00:00
71caac2b30 [MPSInductor] Add rand support (#145705)
Using Philox4 as PRNG

Test plan (other that CI)
Run
```python
mport torch
from torch._inductor.utils import run_and_get_code
from contextlib import nullcontext

def foo(x):
   return x * torch.randn_like(x)

foo_c = torch.compile(foo)

x = torch.ones(100, 100, device="mps")

y = foo_c(x)

print(y.mean().item(), y.std().item())
for i in range(25):
  print(y[i].mean(), y[i].std())
```
And observe that printed values are close to 0 and 1

TODO: Better `randint` algorithm for large ranges

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705
Approved by: https://github.com/dcci, https://github.com/jansel
2025-01-27 06:07:36 +00:00
ea141d8134 functional compiled autograd (#144707)
This PR squashes together the following commits:

https://github.com/pytorch/pytorch/pull/144115
https://github.com/pytorch/pytorch/pull/143417
https://github.com/pytorch/pytorch/pull/143405
https://github.com/pytorch/pytorch/pull/143387
https://github.com/pytorch/pytorch/pull/143304
https://github.com/pytorch/pytorch/pull/143296

This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses.

For more information, please read the commit messages for each PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707
Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel
2025-01-27 05:20:56 +00:00
87fdadde1d Remove FFT from stride incorrect ops (#145080)
I gotta say, the FFT implementation is completely insane, there's gotta be a better way to do this than repeatedly inplace restriding the output tensor. Anyway, this is a faithful translation of both the MKL and cuFFT paths to Python.

Fixes https://github.com/pytorch/pytorch/issues/135087

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145080
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #145530
2025-01-27 04:26:04 +00:00
b75afa2e2e [MPS] cholesky implementation (#145701)
Requested in #77764

Closed #144193  due to a lot of conflicts when rebasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145701
Approved by: https://github.com/malfet
2025-01-27 01:53:03 +00:00
c6ad08357b pickler for GraphModule (#141659)
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
2025-01-26 19:29:13 +00:00
f3ddc08ddc Additional operators in operator benchmark (#145625)
The list of added operators:
add_, addcmul, arange, baddbmm…, bmm, clamp, div, div_, gelu, index_add, logical_and, mul_, sub_, topk, where

This pull request is the same as a previous one: https://github.com/pytorch/pytorch/pull/145121 which inadvertently got deleted while merging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145625
Approved by: https://github.com/jeffdaily
2025-01-26 19:20:02 +00:00
6a4fb4b615 Revert "Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)"
This reverts commit cb814c0b961369a7ab154c58856c730cafaa2307.

Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/malfet due to It broke ROCM tests again, see 5cd2b34e82/1 ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2614523822))
2025-01-26 17:49:05 +00:00
5cd2b34e82 [inductor] Adjust test_log_fp64 to only run when float64 is supported. (#145686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145686
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-26 15:58:19 +00:00
ed015143ef Set RUNPATH on CUDA and XPU tests (#144305)
#136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left.

This PR fixes the rest.

The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`.

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305
Approved by: https://github.com/malfet
2025-01-26 08:40:22 +00:00
c4523999a1 Fix incorrect type comparison (#145449)
Summary: This change was incorrectly made as part of #145166

Differential Revision: D68536221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145449
Approved by: https://github.com/bobrenjc93
2025-01-26 04:40:26 +00:00
09ae69a364 Revert "Fix type annotation of Linear.bias (#142326)"
This reverts commit 81e370fc6b90f9cb98c88f3173e738aba0dc650a.

Reverted https://github.com/pytorch/pytorch/pull/142326 on behalf of https://github.com/malfet due to This introduced a graph break and regressed inductor tests, see 73622fc5fa/1 ([comment](https://github.com/pytorch/pytorch/pull/142326#issuecomment-2614196349))
2025-01-26 03:41:00 +00:00
73622fc5fa Fix Throughputbenchmark issue (#144669)
Fixes [144461](https://github.com/pytorch/pytorch/issues/144461)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144669
Approved by: https://github.com/leslie-fang-intel, https://github.com/williamwen42, https://github.com/jansel
2025-01-26 03:37:20 +00:00
cb814c0b96 Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)
Fixes https://github.com/pytorch/pytorch/issues/142466.
Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case.

Test plan:
```
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859
Approved by: https://github.com/mingfeima, https://github.com/malfet
2025-01-26 01:56:40 +00:00
90448f0128 Output of nonzero is transposed, fix fake tensor (#144695)
Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695
Approved by: https://github.com/bobrenjc93, https://github.com/albanD
2025-01-26 01:07:22 +00:00
76bec878da Remove unnecessary HPUHooksInterface method (#145272)
getDefaultHPUGenerator is no longer necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145272
Approved by: https://github.com/ezyang
2025-01-26 01:06:34 +00:00
3cf7874ebe [MPS][BE] Implement bilineard2d as shader (#145581)
That significantly improves performance and addresses correctness problem(to an extend permitted by reducing precision of scale factor computation to float32). uint8 scaling algorithm mimics CPU/Pillow implementation
569b785371/src/libImaging/Resample.c (L306-L309)
I.e. using fixed precision integral arithmetic and rounding results of horizontal interpolation back to integers before performing vertical one, which results in technically less accurate results.

But even with those changes, `atol`, `rtol` must be tweaked to `1, 0` when scale factor is `1/3` or `2/3` because of the difference of representation  of those values as floats and doubles.

Changes in the performance could be measured using the following script
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="bilinear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16, torch.uint8]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS                 |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results before
```
Benchmarking Results (collected on Apple M4 Pro):
----------------------------------------
Device            :                MPS                 |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8
Outputs Match     :   True  |  True  |  True  |  False |  True  |  True  |  True  |  True
Average Time (us) :  277.3  | 197.2  | 188.0  | 163.5  | 302.8  | 248.1  | 308.7  | 650.9
```
After(almost **100x** perf gain):
```
Benchmarking Results (collected on Apple M4 Pro):
----------------------------------------
Device            :                MPS                 |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    1.7  |   1.5  |   1.7  |   1.5  | 296.5  | 236.0  | 310.8  | 642.6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145581
Approved by: https://github.com/Skylion007
ghstack dependencies: #145578
2025-01-25 21:09:46 +00:00
0afdee4c39 [dynamo] raise IndexError when inserting into a full deque (#139379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139379
Approved by: https://github.com/jansel
2025-01-25 18:04:49 +00:00
513f889a36 [Rocm][Inductor][CK] silence ck package not installed warning when CK backend is not used to autotune bmm (#145626)
As titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145626
Approved by: https://github.com/coconutruben
2025-01-25 08:44:35 +00:00
c5216d2b6c [ca] add test_reset for 2.6 release validation (#145549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145549
Approved by: https://github.com/atalman
2025-01-25 06:28:58 +00:00
bbe7f53218 Save integral tensor data for ET (#144508)
Summary:
et_replay uses random data to run operators, however, the operators using index tensor to access memory won't work with random data. It usually ran into two exceptions: 1. illegal memory access since index is out of range, it has been fixed with the environment variable ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR_RANGE to record the min/max value of index tensors. 2. unaligned memory access, FBGEMM ops have speical requirements for the memory layout.

To fix the second execption, ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR is added to allow user to specify the node names, separated by comma, so ET will save the integral tensor data for these nodes. The saved data will be used in et_replay.

Be careful to turn on this option since it will use more space to save the extra data.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_data_cuda

Differential Revision: D67989856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144508
Approved by: https://github.com/briancoutinho
2025-01-25 05:38:10 +00:00
3d506491b9 [inductor] Fix duplicate detection in _dynamic_scale_rblock (#145577)
Before this the code was doing nothing because Config doesn't define `__hash__` or `__eq__` (so it was based on object id).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145577
Approved by: https://github.com/shunting314
ghstack dependencies: #142026
2025-01-25 04:58:54 +00:00
9007eb5f8e [inductor] Kernel memory analysis for use in heuristics (#142026)
This computes statistics about each kernel's memory usage that should allow us to write more precise heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142026
Approved by: https://github.com/eellison
2025-01-25 04:58:54 +00:00
cc1ecead07 [Dynamo] Allow format() to handle int (#144956)
Fixes #144830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144956
Approved by: https://github.com/jansel
2025-01-25 04:12:45 +00:00
b2a0feac85 Update OSS nested tensor docs to focus on NJT (#145402)
Updated nested tensor docs to be NJT-centric (instead of NST-centric). They now include:
* High-level description of NST vs. NJT + a recommendation to use NJT
* General NJT construction / usage
* torch.compile() integration w/ dynamic shapes
* Common errors and how to fix them
* Contribution guide
* Data layout / shape information (with diagram)
* Links to more extensive tutorials involving Transformers / SDPA / FlexAttention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145402
Approved by: https://github.com/soulitzer
2025-01-25 04:08:19 +00:00
392dc177a9 OpenReg: Refactor impl_registry (#145465)
Refactor impl_registry to use `driver.exec` as fallback.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145465
Approved by: https://github.com/albanD
2025-01-25 03:31:49 +00:00
6939a56e13 [autocast][pytorch] Support autocast for MTIA (#145627)
Summary: Add autocast support to MTIA

Reviewed By: egienvalue

Differential Revision: D68572548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627
Approved by: https://github.com/egienvalue
2025-01-25 03:24:59 +00:00
ef60de07a0 [dynamo] Log guard latency (#145132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132
Approved by: https://github.com/ezyang
ghstack dependencies: #145509
2025-01-25 03:01:18 +00:00
42b8e233d9 serde unbacked bindings (#144894)
Adds unbacked bindings during deserialization. These are carried by a node's metadata, and map pending fresh unbacked symbols to paths to such symbols inside the corresponding example value carried by the node's metadata.

Since it is awkward to serialize paths, we only serialize the names of these symbols and reconstruct the paths on deserialization, using a shape env util. We also need to bump counters for unbacked symbols here, because the shape env util we use to create these symbols (when deserializing example values) don't do so, and not doing so makes later passes (like `run_decompositions`) crash because new unbacked symbols don't get new names.

This is enough for non-strict. For strict, the unbacked bindings and example values in node metadata can get out of sync, because of running AOTAutograd as an additional step after Dynamo. So we have to sync those back.

Differential Revision: [D68232274](https://our.internmc.facebook.com/intern/diff/D68232274/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144894
Approved by: https://github.com/pianpwk
2025-01-25 02:34:27 +00:00
5725462cd8 Update NJT linear_backward to return non-aliased tensor bias grad (#145399)
Fixes https://github.com/pytorch/pytorch/issues/141292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145399
Approved by: https://github.com/jbschlosser
ghstack dependencies: #145520, #145531, #145533
2025-01-25 00:58:04 +00:00
3a3e2cf90a Remove det_singular OpInfo (#145533)
Fixes https://github.com/pytorch/pytorch/issues/93045 https://github.com/pytorch/pytorch/issues/93044

From previous discussion https://github.com/pytorch/pytorch/issues/93045#issuecomment-1477674083 the resolution is that we're okay with removing this.

Some older attempts:
- https://github.com/pytorch/pytorch/pull/102581
- https://github.com/pytorch/pytorch/pull/109249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145533
Approved by: https://github.com/lezcano, https://github.com/malfet
ghstack dependencies: #145520, #145531
2025-01-25 00:58:03 +00:00
c7ca1df37e Disable slow gradcheck for nn.Transformer ModuleInfo (#145531)
Fixes https://github.com/pytorch/pytorch/issues/117140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145531
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #145520
2025-01-25 00:58:03 +00:00
9e0ee152e5 Fix allow_mutation_on_saved_tensors for inplace foreach (#145520)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145520
Approved by: https://github.com/albanD
2025-01-25 00:58:03 +00:00
clr
b4fe3c159d inductor: Explicitly test that torch.compile(option=...) does something (#145321)
This would have prevented https://github.com/pytorch/pytorch/pull/139833 from dropping the triggers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145321
Approved by: https://github.com/jansel
2025-01-25 00:48:26 +00:00
efebec5ef5 [dcp] Add ZStandard transformer (#143360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360
Approved by: https://github.com/saumishr, https://github.com/albanD
ghstack dependencies: #145528
2025-01-25 00:14:07 +00:00
f2ad2cdf1c [utils] add try_import method for importing optional modules (#145528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145528
Approved by: https://github.com/albanD
2025-01-25 00:14:07 +00:00
f3304571fc [BE][Ez]: FURB148 - remove useless enumerate calls (#145619)
Remove useless enumerate calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145619
Approved by: https://github.com/drisspg
2025-01-24 23:37:15 +00:00
0741963e01 [CI][CUDA][Blackwell] sm_\d\d no longer matches sm_100. (#145641)
Therefore making it sm_\d+

Fixes this unit test failure: python test/test_cpp_extensions_jit.py -k TestCppExtensionJIT.test_jit_cuda_archflags

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145641
Approved by: https://github.com/eqy, https://github.com/malfet
2025-01-24 23:20:22 +00:00
4cc5e880f9 Add accuracy issue support in AOTI Minifier (#145539)
Summary:

Add three more repro levels for AOTI minifier (level 2 already exists). They are the same as the existing dynamo minifier repro levels.

Now AOTI minifier can minify and repro programs that have numerical accuracy issues as well.

1: Dumps the original graph out to repro.py if compilation fails
2: Dumps a minifier_launcher.py if aoti fails.
3: Always dumps a minifier_launcher.py. Good for segfaults.
4: Dumps a minifier_launcher.py if the accuracy fails.

Refactor AOTI minifier unit tests to be cleaner and better re-use the existing minifier testing code. We do not need to manually patch {"aot_inductor.dump_aoti_minifier": True} to each test now, this config is generated in the test code.

Differential Revision: D68294638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145539
Approved by: https://github.com/desertfire
2025-01-24 23:07:19 +00:00
5b988ac4fa [Easy] Replace paper description with link to make a concise description. (#145031)
Description in [Transformer,](https://pytorch.org/docs/main/generated/torch.nn.Transformer.html), [TransformerEncoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerEncoderLayer.html), [TransformerDecoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerDecoderLayer.html) pages contain authors and paper details seems redundant for users who want to know how to use it, replace with a link to paper content, users can go to the paper detail if they want to learn more.

**Test Result**

**Before**
![image](https://github.com/user-attachments/assets/678402b1-e759-402c-b56b-e24f63dc8490)
![image](https://github.com/user-attachments/assets/ca191734-f2ce-493f-bf34-2d7046a9868f)
![image](https://github.com/user-attachments/assets/10f55083-6eb6-4b1c-9a77-579f0c4c56ed)

**After**
![image](https://github.com/user-attachments/assets/020f81ca-d89b-47d1-a7a9-cae1893df968)
![image](https://github.com/user-attachments/assets/5b9b34df-b892-4d71-8cdb-df18380b2744)
![image](https://github.com/user-attachments/assets/b3348da2-842a-4037-bad3-f23687503cf8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145031
Approved by: https://github.com/mikaylagawarecki
2025-01-24 23:01:02 +00:00
57591edca1 [mps/inductor] Add support for erfinv. (#145643)
After several rounds of refactoring, this seems to be done now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-24 22:55:44 +00:00
46e06e1d09 Avoid data-dependent errors in NJT tests via capture_scalar_outputs=True (#144588)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

There are several xfails related to data-dependent errors in torch.compile. This PR sets `torch._dynamo.config.capture_scalar_outputs=True` to avoid these, which tends to exercise unbacked SymInt logic and will require `torch._check()`-related fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144588
Approved by: https://github.com/soulitzer
ghstack dependencies: #144586, #144587
2025-01-24 22:45:01 +00:00
81e370fc6b Fix type annotation of Linear.bias (#142326)
Currently the `bias` attribute of `torch.nn.Linear` (and `Bilinear`) is typed incorrectly, because it relies on the implicit `Module.__getattr__` which types it as `Tensor | Module`. This has two issues:

- It hides the fact that `bias` is optional, and can be `None`, which in turn can hide actual bugs on user side.
- It blurs the type due to having `Module` in the union, which can require unnecessary `isistance(linear.bias, Tensor)` on user side.

This PR types the `bias` attribute explicitly to fix these issues.

CC @ezyang @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142326
Approved by: https://github.com/ezyang
2025-01-24 22:43:52 +00:00
70577d335e [ATen][CUDA][Transformers] Add Blackwell support to SDPA (#145602)
This PR adds sm_100 and sm_120 archs to support SDPA (Flash Attention and Memory Efficient Attention) on Blackwell machines.

Special thanks to @Fuzzkatt for co-authoring these changes!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145602
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet

Co-authored-by: Patrick Wang <22803332+Fuzzkatt@users.noreply.github.com>
2025-01-24 22:27:39 +00:00
5bf5ce0e15 Modify enable logic of COLLECTIVE_COMM profiler activity type (#145478)
Summary:
Since `KINETO_NCCL_PROFILER` flag is not used anymore (we are moving from linking the profiler during compile time to loading it dynamically), we change the logic for enabling the profiler to use `TORCH_PROFILER_ENABLE_COLLECTIVE_PROFILING` environment variable for NCCL Collective Communication Profiler.

For HCCL, we still keep the same logic

Test Plan: See  https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/gpu_traces/tree/traces/clientAPI/0/1737579474/devvm29927.cln0/nccl_activities_2387985.json.gz for sample trace on nccl-profiler

Differential Revision: D68515945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145478
Approved by: https://github.com/sraikund16
2025-01-24 22:21:09 +00:00
d4171b724e Let tensor_a.new_tensor() be on tensor_a.device by default (#144958)
Fixes #144957
Closes #73838 cc @albanD @ezyang

Currently, `tensor_a.new_tensor()` will return a on-cpu tensor no matter where is `tensor_a`. This differs from the document and is a side-effect of https://github.com/pytorch/pytorch/pull/41984.

See #144957 how current logic breaks dynamo.

This PR restore the documented behavior and add tests for `new_tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144958
Approved by: https://github.com/ezyang
2025-01-24 22:12:31 +00:00
2a70de7e92 [CUDA] Change slim-wheel libraries load order (#145638)
There is no libnvjitlink in  CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc

Fixes issues reported in https://github.com/pytorch/pytorch/pull/145614#issuecomment-2613107072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145638
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-24 22:00:56 +00:00
FEI
615bdd9c81 Improve the caching allocator test for raw alloc (#145269)
1 Prevent block allocated by torch._C._cuda_cudaCachingAllocator_raw_alloc from affecting torch.cuda.empty_cache() in other unit tests
2 Additionally, tested the changes to raw_delete in https://github.com/pytorch/pytorch/pull/131114

@jeffdaily @albanD @houseroad @eqy @aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145269
Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily
2025-01-24 21:07:17 +00:00
d79c6f4946 Improve torchrun documentation (#144354)
Fixes #142042:
- #142042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144354
Approved by: https://github.com/c-p-i-o, https://github.com/H-Huang
2025-01-24 20:40:05 +00:00
caf60395f4 [torchbench] Increase tolerance for amp only poolformer_m36 (#145375)
https://github.com/pytorch/pytorch/issues/144893

```
python benchmarks/dynamo/timm_models.py --only poolformer_m36 --accuracy --no-translation-validatio  --training --amp --device cuda --backend inductor
```

`--float32`, `--bfloat16` - passes the accuracy
`--disable-cudagraph` does not change the result

accuracy_fail only for `--amp` and  gives `0.048` res_error, on 1-element result Tensor.

This fails with `0.01` tolerance.

If to increase tolerance to 0.04 it passes. I have not reproduced "eager_two_runs_differ" on H100.
I think this is a true distribution of results with `--amp`, so increasing tolerance to 0.04 for ano case only makes it passing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145375
Approved by: https://github.com/desertfire
2025-01-24 19:56:21 +00:00
457facf7e2 [caffe2] Use the manifold cache backend as the default (#144773)
Test Plan: CI

D68155591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144773
Approved by: https://github.com/izaitsevfb
2025-01-24 19:48:34 +00:00
c16866a582 [BE] mv test/inductor_skips/* to test/inductor_expected_failures/ (#145572)
Summary: I think skipping these tests is suboptimal. If we categorize as expected failures, then we'll see test failures when they start passing, which means they're more likely to be removed. As a skip, they quietly continue to skip.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145572
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-24 19:41:38 +00:00
cf063d41f8 Spruce up docs for emulate_precision_casts (#145579)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145579
Approved by: https://github.com/gchanan
2025-01-24 19:28:37 +00:00
96149a201a [Inductor] be able to disable cache for test (#141195)
Let TORCHINDUCTOR_FX_GRAPH_CACHE=0 being respected in unit test. This is helpful if I want the compilation to happen for testing.   Setting INDUCTOR_TEST_DISABLE_FRESH_CACHE to 1 is not the same, since that will cause the generated wrapper file being deleted. But we may want to check those files after running a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141195
Approved by: https://github.com/masnesral, https://github.com/desertfire
2025-01-24 19:15:55 +00:00
2fd2a950e6 [torchbench] Add meta function for _cudnn_rnn_flatten_weight (#145488)
https://github.com/pytorch/pytorch/issues/144989

This fixes tts_angular model on torchbench for `--export-aot-inductor`

I put meta function in cpp, as shape calculation requires cudnn API calls.
I've extracted shape calculation to be used in implementation as this logic has some non-trivial actions and comments.

```
└─ $ python benchmarks/dynamo/torchbench.py --only tts_angular --accuracy --no-translation-validation --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda
loading model: 0it [00:00, ?it/s]WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
loading model: 0it [00:01, ?it/s]
WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
cuda eval  tts_angular
WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145488
Approved by: https://github.com/eqy, https://github.com/zou3519
2025-01-24 19:08:14 +00:00
ad36f4f42c Revert "Add generator parameter to rand*_like functions (#136780)"
This reverts commit c7b2f7dd142fc97c8ce4ad7ad591687cf295fcda.

Reverted https://github.com/pytorch/pytorch/pull/136780 on behalf of https://github.com/izaitsevfb due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/136780#issuecomment-2613191933))
2025-01-24 19:00:21 +00:00
a989a0b13a [NFC] Fix some minor typos. (#145599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599
Approved by: https://github.com/Skylion007
2025-01-24 18:58:59 +00:00
6cda572c98 [mps] Hoist erfinv logic out of the kernel in preparation for moving. (#145568)
Will be used in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145568
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-01-24 18:51:09 +00:00
8eea554332 [Dynamo] Fix names collisions with foreach decomps (#145479)
Fixes https://github.com/pytorch/pytorch/issues/138698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145479
Approved by: https://github.com/yanboliang
2025-01-24 18:46:58 +00:00
e57cdb8402 [ROCm] trunk.yml only runs pre-merge via ciflow/trunk label (#145629)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145629
Approved by: https://github.com/jeffdaily
2025-01-24 18:31:33 +00:00
b8087747f5 [inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373)
Differential Revision: D68278174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145373
Approved by: https://github.com/Skylion007
2025-01-24 17:59:13 +00:00
74cfb4f364 [dynamo][refactor] Move collections.namedtuple out of SkipFunctionVariable (#145547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145547
Approved by: https://github.com/zou3519
ghstack dependencies: #145519
2025-01-24 17:39:33 +00:00
97c0b7cb0a Add unique identifer to bmm thread_mm functions (#145303)
Summary:
The bmm template generates code like this

```
template<bool accum>
void cpp_fused_bmm_66_micro_gemm(...) {
    ...
}

void single_thread_mm() {
    ...
    cpp_fused_bmm_66_micro_gemm(...)
    ...
}

void threaded_mm() {
    ...
    cpp_fused_bmm_66_micro_gemm(...)
    ...
}

void cpp_fused_bmm_66(...)
{
    ...
    single_thread_mm(...);
    ...
    threaded_mm(...);
    ...
}
```

The generated  `fused_bmm` and `fused_bmm_microgemm` functions both have unique identifiers added to their names, but the `single_threaded_mm` and `threaded_mm` do not.

This diff adds unique identifies to those generated functions as well. The identifier is based on the kernel name. So for the example above we would generate a bmm template name like `cpp_fused_bmm_66_single_thread_mm()`.

Differential Revision: D68364772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145303
Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475
2025-01-24 17:35:50 +00:00
547c18ee9f Add Torchao docs link to Pytorch libraries (#145412)
Add Torchao docs link to the libraries section in torch docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145412
Approved by: https://github.com/svekars
2025-01-24 17:11:20 +00:00
ce371ab4c6 [ROCm] Create inductor-rocm-mi300 (#145621)
- Adds an mi300 inductor workflow to main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145621
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-24 17:04:17 +00:00
c0861d092c [PGNCCL] Add an API to get the status/error code at the PG level (#144498)
Summary:
This PR is basically a replacement of
https://github.com/pytorch/pytorch/pull/140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread

If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective

Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498
Approved by: https://github.com/kwen2501
2025-01-24 16:47:32 +00:00
9132f4b7ce [dynamo][guards] Log guard latency to tlparse (#145509)
Example
![image](https://github.com/user-attachments/assets/1503ee59-ff35-46d9-9b61-16352a4a30e2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145509
Approved by: https://github.com/ezyang
2025-01-24 16:33:29 +00:00
1335882b2a If mypy fails it should report the error back to lintrunner (#145550)
This happened to me because I had a bad LD_LIBRARY_PATH and mypy was failing to run (.so load error) - but lintrunner was silent about the underlying problem.

Differential Revision: [D68593081](https://our.internmc.facebook.com/intern/diff/D68593081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145550
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2025-01-24 15:40:30 +00:00
7c314bfed4 [Intel GPU] Add TORCH_API macro to export symbol NestedTensor_to_mask for libtorch_xpu (#145467)
Part of https://github.com/intel/torch-xpu-ops/issues/1141.

The `TORCH_API` macro is added to export the symbol `NestedTensor_to_mask`, which is needed by libtroch_xpu for `NestedTensor_softmax_dropout_xpu`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145467
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-01-24 15:38:46 +00:00
5d24a9a274 Advance docker release latest verison to cuda 12.4 (#145566)
Fixed latest tag in ghcr.io to be cuda 12.4 docker image. Todo, Need to add it to : https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD

Will need to check if we can automate this by introducing cuda_stable variable or something like this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145566
Approved by: https://github.com/nWEIdia, https://github.com/kit1980, https://github.com/malfet
2025-01-24 15:27:25 +00:00
5c64aaea40 [triton] Update triton pin to include warp specialization support (#145120)
The warp specialization work has been landed to the triton rc/3.2.x branch as b2684bf3b0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120
Approved by: https://github.com/bertmaher
2025-01-24 14:45:12 +00:00
bc62930765 Work around buggy use_const_ref_for_mutable_tensors (#145530)
See https://github.com/pytorch/pytorch/issues/145522 for context

This doesn't fix the problem with use_const_ref_for_mutable_tensors and the boxed wrapper, instead it just gets all of our out kernels off of this flag so that the mutable matching pattern works correctly. I also add a check in torchgen to prevent people from making this mistake in the future.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145530
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2025-01-24 14:38:49 +00:00
9d6927715f Revert "Fix triton masked loading for non-block tl.loads (#144782)"
This reverts commit 31c2f36989e35ccf023a8e35c4bc21aca077d344.

Reverted https://github.com/pytorch/pytorch/pull/144782 on behalf of https://github.com/ezyang due to This regresses compile time for one of our internal models by 20%, internal xref https://fb.workplace.com/groups/1075192433118967/posts/1591490218155850 ([comment](https://github.com/pytorch/pytorch/pull/144782#issuecomment-2612660287))
2025-01-24 14:28:48 +00:00
cyy
6a35d9aaa4 Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-24 12:22:13 +00:00
f08b9bc7e4 [WIP] Move XNNPACKQuantizer from PyTorch to ExecuTorch (#144940)
Summary:
This replicates XNNPACKQuantizer from PyTorch to ExecuTorch.

Rationale:
Main motivation is to avoid pytorch pin update in OSS after updating XNNPACKQuantizer, which can be rather frequent.
Other impact and considerations:
PT2e flow (which lives in PyTorch) relies havily on XNNPACKQuantizer for a "example" implementation for quantizer and more importantly tests. Fow now, we will keep the torch.ao.quantization.xnnpack_quantizer as is but mark is as not BC, and deprecated to discourace future new dependencies on it.
Other OSS repository using XNNPACKQuantizer from PyTorch now have to take an additional dependency on ExecuTorch.

Differential Revision: D68191752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144940
Approved by: https://github.com/jerryzh168, https://github.com/mcr229
2025-01-24 10:06:07 +00:00
d3989ca636 Add multi env variable support to configs (#145288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288
Approved by: https://github.com/c00w
2025-01-24 10:04:24 +00:00
10bdd0a1cc [BE][export] Fix hop tests with flaky memory leak (#145391)
Summary:
As title. Added `torch._dynamo.reset()` for each test

This should fix several flaky tests in `test_hop.py` such as https://github.com/pytorch/pytorch/issues/139073

Test Plan:
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/export/test_hop.py TestHOPCUDA.test_serialize_export_scan_simple_cuda_float32
```

Differential Revision: D68506280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145391
Approved by: https://github.com/ydwu4
2025-01-24 09:53:21 +00:00
72da0a8a42 [Submodule] Add flash as third-party submodule [Prep for later PRs] (#145502)
# Context

Prototyped here: https://github.com/pytorch/pytorch/pull/144120, we are going to make flash-attention a 3rd party submodule. We will then use the c++ sources and include into our build of libtorch.so

This requires various changes to work including external and internal changes. Since these require internal changes we need to co-dev and in the co-dev environment I haven't found a way to sync submodule changes + internal only changes.

This is unused for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145502
Approved by: https://github.com/Skylion007
2025-01-24 09:21:41 +00:00
d62e900d8c [CI][CUDA][MultiGPU][Regression] Skip a failure due to https://github.com/pytorch/pytorch/issues/139520 (#145318)
Related: https://github.com/pytorch/pytorch/issues/139520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145318
Approved by: https://github.com/eqy
2025-01-24 06:58:05 +00:00
0e98b26b28 [CI][CUDA][Dynamic Shape] xfail: DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda (#145204)
python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda

failed to generate triton kernels, causing assert failures on 2x H100 systems (and 2x Grace H100 systems).

Failures like below:

Finline_call []                                                                                                                                                    stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)]

FAIL: test_linspace4_dynamic_shapes_cuda (__main__.DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda)                                       [61/1892]----------------------------------------------------------------------                                                                                             Traceback (most recent call last):                                                                                                                                   File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 12212, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/testing.py", line 420, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 2603, in test_linspace4
    self.common(fn, (torch.Tensor([]),))
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 424, in common
    return check_codegen(
           ^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 82, in check_codegen
    self.assertTrue("def triton" in code, f"Failed to find triton kernel\n{code}")
AssertionError: False is not true : Failed to find triton kernel

# AOT ID: ['0_inference']                                                                                                                                 [42/1892]from ctypes import c_void_p, c_long, c_int
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks
from torch._inductor.utils import maybe_profile
from torch._inductor.codegen.memory_planning import _align as align
from torch import device, empty_strided
from torch._inductor.async_compile import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels
from torch._inductor.codegen.multi_kernel import MultiKernelCall

aten = torch.ops.aten
inductor_ops = torch.ops.inductor
_quantized = torch.ops._quantized
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
alloc_from_pool = torch.ops.inductor._alloc_from_pool
async_compile = AsyncCompile()
empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p

async_compile.wait(globals())
del async_compile

def call(args):
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf0 = empty_strided_cuda((0, ), (1, ), torch.float32)
    return (buf0, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    fn = lambda: call([])
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145204
Approved by: https://github.com/eellison
2025-01-24 06:57:35 +00:00
817fd14714 [BE] Type annotation for _inductor/dependencies.py (#145311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145311
Approved by: https://github.com/eellison
2025-01-24 06:32:48 +00:00
2ce70da96c [cp] override compute_log_sumexp to True for aten._scaled_dot_product_efficient_attention.default if False (#145421)
## Description
Our current CP doesn't support efficient attention when `compute_log_sumexp=False`. `compute_log_sumexp=False` only if that `requires_grad=False` and since PP's [shape inference](d95a6babcc/torch/distributed/pipelining/stage.py (L1387)) happens under `torch.no_grad()` context , we need to override `compute_log_sumexp` to `True` in our CP attention implementation.

## Test
- Test PP+FSDP+CP w/ `mixed_precision = "float32"` in torchtitan

- `pytest test/distributed/tensor/test_attention.py -s -k test_ring_attention_sdpa`

Before:
<img width="1880" alt="image" src="https://github.com/user-attachments/assets/872ff583-295e-4751-a280-cf7f2d41c61a" />

After:
<img width="2988" alt="image" src="https://github.com/user-attachments/assets/4bdcc2e5-22a5-427a-91a5-82206d5bd78f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145421
Approved by: https://github.com/H-Huang, https://github.com/tianyu-l
2025-01-24 06:17:54 +00:00
53fc921ce2 [dynamo][trace-rules-cleanup] Remove functools from the Builtins skiplist (#145519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145519
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2025-01-24 06:02:03 +00:00
9752c7c1c8 [CD] Fix slim-wheel cuda_nvrtc import problem (#145582)
Similar fix as: https://github.com/pytorch/pytorch/pull/144816

Fixes: https://github.com/pytorch/pytorch/issues/145580

Found during testing of https://github.com/pytorch/pytorch/issues/138340

Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.

CUDA 11.8 path:
```
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib
__init__.py  __pycache__  libnvrtc-builtins.so.11.8  libnvrtc-builtins.so.12.4  libnvrtc.so.11.2  libnvrtc.so.12
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib
__init__.py  __pycache__  libnvJitLink.so.12
```

Test with rc 2.6 and CUDA 11.8:
```
python cudnn_test.py
2.6.0+cu118
---------------------------------------------SDPA-Flash---------------------------------------------
ALL GOOD
---------------------------------------------SDPA-CuDNN---------------------------------------------
ALL GOOD
```

Thank you @nWEIdia for discovering this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145582
Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-24 04:47:57 +00:00
732c4998f3 [NVIDIA] Full Family Blackwell Support codegen (#145436)
More references:
https://github.com/NVIDIA/nccl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145436
Approved by: https://github.com/ezyang, https://github.com/drisspg
2025-01-24 04:36:00 +00:00
c184055743 [BE] Use value_or in layer_norm.cpp (#145417)
Now that we have proper optional, no need to do `if (has_value) value else default_value;`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145417
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-24 04:02:23 +00:00
4799ebf326 [MPS][BE] Turn bicubic2d into generic metal template (#145578)
In preparation for more metal shaders to come
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145578
Approved by: https://github.com/Skylion007
2025-01-24 04:01:23 +00:00
68a1505985 serde and_ operator (#145506)
Differential Revision: [D68565887](https://our.internmc.facebook.com/intern/diff/D68565887/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145506
Approved by: https://github.com/zhxchen17, https://github.com/Skylion007
2025-01-24 03:48:03 +00:00
29ddf9a63e Document dispatch trace build flag (#145517)
Ok, the build flag seems to have been broken for a while since the function it calls doesn't exist anymore.
Repurposed it to enable dispatcher printing (which requires a full (and slow) debug build otherwise).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145517
Approved by: https://github.com/bdhirsh
2025-01-24 03:19:39 +00:00
a40ead1fd6 Don't fail if fresh_inductor_cache fails to clean up its tmp dir. (#145513)
Summary: I see we have a test failure due to an error removing the tmp dir: https://github.com/pytorch/pytorch/issues/141761. Seems like we should not raise an exception for this case in general. Also, let's clean up the exception handling related to windows. The comment makes it sound like we want to specifically ignore failures cleaning up, but the current impl is swallowing all exceptions.

Fixes #141761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145513
Approved by: https://github.com/eellison
2025-01-24 03:17:03 +00:00
36fcf98db6 [cutlass backend tests] Manually clear cache, test more tests in fbcode and limit configs in some tests (#145545)
Summary:
Manually clear cache:
You want to clear cache in most tests. Otherwise link command won't work and you have multiple .o files and you get something like `ld.lld: error: duplicate symbol: cuda_fused_0`.

test more tests in fbcode:
A few tests have been skipping in fbcode. Unskip them.

limit configs in some tests:
to reduce time spent on each test

Differential Revision: D68584071

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145545
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-01-24 03:06:59 +00:00
386650353b [ARM] Fix bf32 and tf32 precision for tensordot unit test (#141136)
Fixes unit test failure on aarch64 ( neoverse-v1 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141136
Approved by: https://github.com/malfet
2025-01-24 02:59:45 +00:00
d6bea398ac Only include RMSNorm.h in layer_norm.cpp for MPS (#145524)
Test Plan: CI

Differential Revision: D68578213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145524
Approved by: https://github.com/malfet
2025-01-24 02:08:49 +00:00
d5629889f1 cpp_wrapper: Properly handle scalars when input to tensor arguments (#144910)
Additionally, reduce code duplication in `cpp_wrapper_cpu_array_ref.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144910
Approved by: https://github.com/desertfire
2025-01-24 02:06:35 +00:00
47e65077b1 OpenReg: Remove REGISTER_GENERATOR_PRIVATEUSE1 (#144841)
Replace REGISTER_GENERATOR_PRIVATEUSE1 with new API in AcceleratorHooksInterface.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144841
Approved by: https://github.com/albanD
2025-01-24 01:52:10 +00:00
cd68d54911 Inductor cache: Revamp how we handle frozen params (#143808)
Summary: In https://github.com/pytorch/pytorch/pull/143563 we have a report of a problem with the treatment of frozen params in the inductor cache implementation. There seems to be a path where new constants are added in the `GraphLowering`. On a cache hit when we try to find those constant names in the `torch.fx.GraphModule`, they do not exist. The current approach treats all constants differently if the GM has any frozen params. This PR changes the approach to only treat the _frozen_ params specially, but store all other constants in the cache entry (as we do without freezing):
1) When creating a cache entry, store the names of any frozen params, but the values of any other constants.
2) On a cache hit, restore the values of the frozen params by looking up in the current GM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143808
Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison
2025-01-24 01:20:07 +00:00
54e2f4b201 Fix lerp weight type promotion (#141117)
Fixes #140601

Enable `promote_inputs_to_common_dtype` when tensors not same dtype when invoke `lerp` function.

For `lerp_Tensor`
- Check whether same `dtype` of tensors, enable promote if not
- Remove type check assert

For `lerp_Scalar`
- Seems already enable `promote_inputs_to_common_dtype` by default, just remove the type check. Make sure promote behavior consistent with `lerp_Tensor`

`lerp_Scalar` get TensorIteratorConfig from here
c37185c76a/aten/src/ATen/TensorIterator.cpp (L979-L985)

**Test Result**
Test case in issue passed

```python
>>> import torch
>>>
>>> x = torch.ones(2, 2, dtype=torch.float64)
>>> w = torch.ones(2, 2, dtype=torch.float64)
>>> s = torch.tensor(2.2)
>>> x.lerp_(w, s)
tensor([[1., 1.],
        [1., 1.]], dtype=torch.float64)

>>> x = torch.ones(2, 2, dtype=torch.float16)
>>> w = torch.ones(2, 2, dtype=torch.float16)
>>> s = torch.tensor(2.2)
>>> x.lerp_(w, s)
tensor([[1., 1.],
        [1., 1.]], dtype=torch.float16)

```

```bash
$ pytest test/test_binary_ufuncs.py -k 'test_lerp_tensor_type_promotion or test_lerp_scalar_type_promotion'
```
![image](https://github.com/user-attachments/assets/288a5294-a9ee-47f3-bbf7-d4ff986f3ba8)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/d469836f-5c49-4d89-a2fd-379cad4db3af)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141117
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-24 01:18:20 +00:00
b2c89bc115 [inductor][2/N] triton support post-#5512, user-defined triton kernels (#145348)
Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This PR fixes user-defined triton kernel handling (in most cases) for these new triton commits.

What this PR fixes:
* in triton_kernel_wrap.py, AST->TTIR parsing was to be updated for the new triton API
* ir.py - don't remove None args when using newer triton versions
* wrapper.py - update signature & constant handling

What this doesn't fix:
* correct None handling - I want to do a closer look at constant handling (including None, equal_to_1, and other constants).
* cpp wrapper (which needs to be fixed for both user-defined triton kernels and inductor-generated kernels)

test/inductor/test_triton_kernels.py passed on triton commit 74de6b46, with the exception of three tests (those shown here: 1374074098)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145348
Approved by: https://github.com/jansel
ghstack dependencies: #145051
2025-01-24 00:34:01 +00:00
b963ab5325 [inductor][1/N] triton support post-#5512, main components (#145051)
Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This is an initial PR to add support for Triton versions after commit 5512 landed.

The main changes in 5220 and 5512 that need to be supported:
* AttrsDescriptor() gets replaced with a raw dict. The raw dict has the format `{(TUPLES): [["tt.divisibility", 16]]}`, where `(TUPLES)` is a tuple of indices, e.g. `((0,), (1,), (3,))` to indicate that args 0, 1, and 3 are divisible by 16. These indices are, themselves, represented as tuples to support nested inputs (e.g. an argument that's a tuple), but support for tuples is not implemented right now.
* "signature" changes: the signature now contains _all_ args, including constexpr and constant args.
* ASTSource now takes "constexprs" instead of "constants" - for example, equal-to-1 args are constants but not constexprs so we don't need to pass these args as "constants".

What this PR supports:
* Triton versions before Dec 9, 2024, and (partial support for) Triton versions after Jan 1, 2025
* (triton jan 1+) typical inductor-generated triton: updated AttrsDescriptor, signatures, constexpr/constant handling.

What this PR doesn't support (TODO in follow-up PRs):
* Triton versions between Dec 9, 2024 and before Jan 1, 2025
* (triton jan 1+) user-defined triton kernel support (this is implemented already in @anmyachev's patch)
* (triton jan 1+) triton_helper support (failing in triton codegen - needs investigation)
* (triton jan 1+) AOTI / cpp wrapper

thanks to @anmyachev for patches in https://github.com/intel/intel-xpu-backend-for-triton/blob/main/scripts/pytorch.patch, which contains most of these changes already

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145051
Approved by: https://github.com/jansel
2025-01-24 00:34:01 +00:00
714f64329b Revert "Add multi env variable support to configs (#145288)"
This reverts commit a8b7cb6a2ddbba4924b6b2531f1ecd2f5ed6d512.

Reverted https://github.com/pytorch/pytorch/pull/145288 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint from a landrace with some recent PEP585 changes ([comment](https://github.com/pytorch/pytorch/pull/145288#issuecomment-2611278428))
2025-01-24 00:20:00 +00:00
6a2b4db0a1 Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)"
This reverts commit 42f4fda2ebb27693411f7acca1665778d539bf79.

Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))
2025-01-24 00:17:34 +00:00
6f60c65a3a Revert "[dynamo] Log guard latency (#145132)"
This reverts commit 0a310d738819ae000f49b32298305724117634c2.

Reverted https://github.com/pytorch/pytorch/pull/145132 on behalf of https://github.com/anijain2305 due to CI failures observed after PR was merged ([comment](https://github.com/pytorch/pytorch/pull/145132#issuecomment-2611268421))
2025-01-24 00:11:50 +00:00
f0e9f87a9b [BE/mps] Mark input args as constant to prevent incorrect usage. (#145535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145535
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-24 00:11:44 +00:00
6aaae9d78f Make torchelastic etcd rendezvous publicly importable (#145396)
Make torchelastic publicly importable by raising error on import etcd lazily, [BE task, row 7](https://docs.google.com/spreadsheets/d/1TtATnLJf1rVXaBQd3X3yYqm9xNN9BIWG7QqRgrFiRRI/edit?gid=1748512924#gid=1748512924)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145396
Approved by: https://github.com/albanD
ghstack dependencies: #145387
2025-01-23 23:56:45 +00:00
f8a4f16634 [c10d] fix memory leak on shutdown (#145507)
Summary:
Fix memory leak on shutdown when socket is closed.
We still need to free the buffer to make valgrind happy.

Test Plan:
Use `mtiavm`.
Repro steps provided by cristianlume.

on window 1:
```
vm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2
```
on window 2:
```
vm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1
```

without the fix:
```
==8766==ERROR: LeakSanitizer: detected memory leaks
```
With fix, no leak

Differential Revision: D68566104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145507
Approved by: https://github.com/XilunWu, https://github.com/d4l3k
2025-01-23 23:36:15 +00:00
6dd8283381 Revert "[compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296)"
This reverts commit 5531fafffefc45cd894040b2b07b0d5227430082.

Reverted https://github.com/pytorch/pytorch/pull/143296 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
c3fadacf84 Revert "[compiled autograd] Proxy a node for CopyBackwards into the graph (#143304)"
This reverts commit 8c7c5f7bfcbc55638a0e4aed6eaa27f6194dbebe.

Reverted https://github.com/pytorch/pytorch/pull/143304 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
9553301ade Revert "[compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387)"
This reverts commit 784bb2127ca9729c646f1650ecc2cf946a583da8.

Reverted https://github.com/pytorch/pytorch/pull/143387 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
16c4f8c395 Revert "[compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405)"
This reverts commit ec820fe57c2d6a2847569a107856e7fcff87dc5c.

Reverted https://github.com/pytorch/pytorch/pull/143405 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
3f6cfd0156 Revert "[compiled autograd] stop specializing on metadata during initial trace (#143417)"
This reverts commit 99dd1bf1b93bc26080e611af54497a73a618e02a.

Reverted https://github.com/pytorch/pytorch/pull/143417 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:12 +00:00
ab082863a1 Revert "[compiled autograd] support Tensor Subclasses in AOTBackward (#144115)"
This reverts commit 082c28c3c655984ce65c13336cff822db95ee470.

Reverted https://github.com/pytorch/pytorch/pull/144115 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:12 +00:00
0a310d7388 [dynamo] Log guard latency (#145132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132
Approved by: https://github.com/ezyang
ghstack dependencies: #145351, #145420
2025-01-23 23:30:07 +00:00
bf62222d81 Revert "[compiled_autograd] Rename interface to pyinterface (#145495)"
This reverts commit e1407f5aeb658c8c959d33158f465e975799a3d0.

Reverted https://github.com/pytorch/pytorch/pull/145495 on behalf of https://github.com/izaitsevfb due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145495#issuecomment-2611194932))
2025-01-23 23:07:17 +00:00
a8b7cb6a2d Add multi env variable support to configs (#145288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288
Approved by: https://github.com/c00w
2025-01-23 23:00:23 +00:00
dad9bc3461 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit de945d78da9198e58df7c19c53b737d0f987ddff.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))
2025-01-23 22:59:25 +00:00
cyy
42f4fda2eb Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-23 22:47:18 +00:00
6f07847efe Bail on checking internal overlap when dealing with unbacked symints (#145385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145385
Approved by: https://github.com/ezyang
2025-01-23 22:31:31 +00:00
e1407f5aeb [compiled_autograd] Rename interface to pyinterface (#145495)
Summary: interface is a reserved word in some MSVC variants.

Test Plan: build

Differential Revision: D68561379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145495
Approved by: https://github.com/xmfan
2025-01-23 21:40:59 +00:00
302b07f166 Implement deepcopy for AOTICompiledModel (#145423)
Summary:

Fix https://github.com/pytorch/pytorch/issues/145411

Support deepcopying AOTICompiledModel. The `loader` is shallow copied.

Test Plan:
```
buck2 run fbcode//mode/opt //caffe2/test/inductor:aot_inductor_package -- -r deepcopy
```

Differential Revision: D68524673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145423
Approved by: https://github.com/desertfire
2025-01-23 21:05:30 +00:00
e924ddbef1 [BE] [mps] Refactor UnaryConstants to be its own kernel. (#145230)
In preparation for using this file for inductor (for erfinv).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145230
Approved by: https://github.com/malfet
2025-01-23 20:58:43 +00:00
881eb86692 Fix staging for CPU tensors in OSS DCP async_save (#145408)
Fix staging for CPU tensors in OSS DCP async_save (#145408)

Summary:

As found in
https://github.com/pytorch/pytorch/issues/144657
for CPU tensors we accidentally skip copying during staging due to using offload to cpu helper, which does a no-op for CPU tensors. This means that if the trainer changes the original source CPU tensor value after launch async save but before the actual writing/uploading to the destination commences, the writing/uploading logic will accidentally pick up the latest state of the tensor, while it should have dealt with its own dedicated copy saved earlier. Dropping _offload_state_dict_to_cpu in favor of _copy_state_dict fixes this bug.

Test Plan:
Running the user script from the linked GitHub issue verifies the fix:
```
import os

import torch

import torch.distributed as dist
import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_model_state_dict
import torch.nn as nn


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(1, 1))

    def forward(self, x):
        return self.layer(x)

os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12345"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"

dist.init_process_group()

model = Net()
state_dict = get_model_state_dict(model)
pg = dist.new_group(backend="gloo")

try:
    steps = [10, 20, 30, 40, 50]
    future = None
    for step in steps:
        # simulate a training step, e.g. optimizer updating values
        with torch.no_grad():
            model.weight.data.fill_(step)

        if future is not None:
            future.result()
            future = None
        future = dcp.async_save(
            state_dict,
            checkpoint_id=f"outputs/{step}",
            process_group=pg,
        )

    future.result()

    for step in steps:
        dcp.load(
            state_dict,
            checkpoint_id=f"outputs/{step}",
            process_group=pg,
        )
        assert state_dict["weight"][0, 0] == step, f"got {state_dict['weight'][0, 0]=} on {step=}"
finally:
    dist.destroy_process_group(pg)
    dist.destroy_process_group()
```
passes all asserts with this fix. If the script is run in trunk, confirmed that it fails the first assert.

Differential Revision: D68518689
2025-01-23 12:49:26 -08:00
6a44a61514 [BE] Bump TIMM pin (#145320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145320
Approved by: https://github.com/Skylion007
2025-01-23 20:44:26 +00:00
99367ecbed [draft export] count how many times a data-dep error shows up (#145030)
Summary: maybe this is helpful?

Test Plan: draft_export

Differential Revision: D68303934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145030
Approved by: https://github.com/angelayi
2025-01-23 20:27:31 +00:00
5ebca3015d [BE]: Simplify set add with set update (#145152)
Simplifies the set update slightly to be more readable and efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-01-23 20:18:13 +00:00
d7b6746470 Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347)"
This reverts commit c27dd9cf72265161f85a18c0b19f365097f7a1ac.

Reverted https://github.com/pytorch/pytorch/pull/145347 on behalf of https://github.com/huydhn due to Remove -e breaks the theme somehow ([comment](https://github.com/pytorch/pytorch/pull/145347#issuecomment-2610911258))
2025-01-23 20:06:07 +00:00
d53f2067fe [BE][export] add "+export" logging to de/serialization (#145283)
adds de/serialization debug logging to `TORCH_LOGS="+dynamic"`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145283
Approved by: https://github.com/ydwu4, https://github.com/angelayi
2025-01-23 19:47:48 +00:00
ce4a097bf7 Revert "Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)"
This reverts commit 55084443cabbaf6c28d8c546d8988cf3ed0f3d1c.

Reverted https://github.com/pytorch/pytorch/pull/144829 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/144829#issuecomment-2610855579))
2025-01-23 19:37:54 +00:00
527101fa95 Move Windows arm64 scripts from pytorch/builder (#144317)
This PR moves the Windows Arm64 scripts from the builder repository to the main repository. The corresponding PR to pytorch/builder that removes them is here : https://github.com/pytorch/builder/pull/2058
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144317
Approved by: https://github.com/Skylion007, https://github.com/seemethere

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-23 19:29:29 +00:00
66bf7da446 Enable sleef for Win Arm64 (#144876)
Sleef module was disabled for Windows Arm64 on b021486405
This PR enables it again since the issue is no longer valid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144876
Approved by: https://github.com/albanD, https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-23 19:22:58 +00:00
991a4b5925 [dynamo] Add --profile-details and --export-perfdoctor option (#144751)
Summary:
Add `--profile-details` option to add shapes and other details to the Kineto profile.

Add `--export-perfdoctor` to directly dump trace to perfdoctor for webview.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench_internal -- --only mrs_video_watch_over --performance --training --amp --export-profiler-trace --backend=inductor --profile-details --export-perfdoctor
```

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/pyper_traces/tree/traces/test/inductor_mrs_video_watch_over_rank_0_20250113_173817_6535183793.json.gz

Differential Revision: D68134547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144751
Approved by: https://github.com/drisspg
2025-01-23 19:09:40 +00:00
5b37249259 Enable fp16 linear layers in PyTorch via ACL (#144992)
This pull request aims to enable the use of linear layers with the fp16 data type through the ACL.

On a Graviton3 instance running with 16 threads, `torch.randn(2048, 4096, dtype=torch.half)` will take 50+% less time to complete compared with `torch.randn(2048, 4096, dtype=torch.float32)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144992
Approved by: https://github.com/ng-05, https://github.com/digantdesai, https://github.com/malfet
2025-01-23 19:07:54 +00:00
6d4f5f7688 [Utilization][Usage Log] Add data model for record (#145114)
Add data model for consistency and data model change in the future.

The data model will be used during the post-test-process pipeline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114
Approved by: https://github.com/huydhn
2025-01-23 19:04:41 +00:00
2f317bbdbc Missing autorelease in lstm_mps caused a ton of leaked memory (#145503)
The dictionary held onto the new MPSGraphTensorData objects and MPSNDArrays.  Regression caused by https://github.com/pytorch/pytorch/pull/95137

Fixes #145374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145503
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-23 18:54:30 +00:00
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44f3cf69dd840db990245cbe1f78e770 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
34b8d8b0c0 update compile time benchmarks to dump compile times to stdout and csv (#145447)
```python
# inductor.csv
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,cait_m36_384,8,pass,2510,1,0,0,0,0,0,87.705186
```

```python
loading model: 0it [01:27, ?it/s]
cuda eval  cait_m36_384
Compilation time (from dynamo_timed): 87.705186276  # <----------------
pass
TIMING: _recursive_pre_grad_passes:0.11023 pad_mm_benchmark:0.50341 _recursive_joint_graph_passes:3.88557 _recursive_post_grad_passes:6.71182 async_compile.wait:4.16914 code_gen:17.57586 inductor_compile:42.55769 backend_compile:72.47122 entire_frame_compile:87.70519 gc:0.00112 total_wall_time:87.70519
STATS: call_* op count: 2510 | FakeTensorMode.__torch_dispatch__:101743 | FakeTensor.__torch_dispatch__:12959 | ProxyTorchDispatchMode.__torch_dispatch__:41079
Dynamo produced 1 graphs covering 2510 ops with 0 graph breaks (0 unique)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145447
Approved by: https://github.com/ezyang
2025-01-23 18:49:19 +00:00
629fb1590c [BE] Type annotate pad_mm.py (#145409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145409
Approved by: https://github.com/Skylion007
2025-01-23 18:34:24 +00:00
015c6d6fdb [dynamo][guards] Turn on profiling of guard manager (#145420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145420
Approved by: https://github.com/ezyang
ghstack dependencies: #145351
2025-01-23 18:17:43 +00:00
fef92c9447 Fix IdentationError of code example (#145251)
I found there is IndentationError when try to copy paste the example of inference with torch.compile
fix the format in this pr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145251
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-23 18:17:11 +00:00
9a5bc7b6dd [BE] Type annotate metrics.py (#145418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145418
Approved by: https://github.com/Skylion007
2025-01-23 18:13:59 +00:00
bdc2c2a237 [be] fix flaky test aot_export_ cond caused by free symbol lifting and automatic dynamic shape (#145330)
Fixes https://github.com/pytorch/pytorch/issues/139998#issuecomment-2605908426.

It seems to be an issue caused by the interaction between dynamoed hop X automatic dynamic shape X auto_lift_free symbols. The immediate error is that the asserteExpectedInline of the graph can sometimes be different e.g. see https://hud.pytorch.org/flakytest?name=test_aot_export_with_torch_cond&suite=TestAOTExport&limit=100, where sometimes the shapes are lifted as input to the cond and sometimes they're not.

The root cause of the flakyness is that the two invocations of torch.cond triggers two torch.compile on the same code object ([code](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/cond.py#L192)), and triggers automatic dynamic shape because in test_aot_export_with_torch_cond, x has shape (3, 4) while the pre_dispatch one has shape (2, 2). Because of we auto lift free symbols for dynamic shaped input, this causes cond sometimes have the shape as arguments and sometimes not.

This PR adds a simple fix by adding a _dynamo.reset before each torch.cond tests. This fixes the error by not triggering automatic dynamic shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145330
Approved by: https://github.com/zou3519
2025-01-23 18:12:58 +00:00
3c247ee8c4 [hop][be] add utils for more comprehensive input alias and mutation (#145298)
This PR implements  the idea of checking input mutations through tensor version and check aliasing via storage  from @zou3519. Previously, we rely on whether there's a in place op that takes placeholder input, which doesn't take views into account.

When writing the PR, I also noticed a bug in previous input mutation checking logic: we were checking the whether there are operators functionalized_f where all the mutating ops have been replaced so we won't be able to detect any thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145298
Approved by: https://github.com/zou3519
2025-01-23 18:12:28 +00:00
b0f3597133 Add fused rms_norm implementation for MPS backend (#145301)
Adding a fused rms_norm implementation for MPS backend. This eliminates most of the current CPU overhead, making this computation GPU bound and improving latency of rms_norm by **30x-40x** on MPS backend
The metal shader was adapted from MLX: e6a7ab9675/mlx/backend/metal/kernels/rms_norm.metal

The numbers below are averages over 1000 runs of RMSNorm, obtained on an M1 Pro.

Benchmarking Results (Before):
```
Device            :            MPS            |           CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  140.5  | 171.0  | 170.4  |  10.9  |  13.3  |  13.5
```

Benchmarking Results (After):
```
Device            :            MPS            |           CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    4.0  |   3.9  |   3.9  |  10.0  |  12.4  |  13.0
```

Profiling Results (Before):
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
               aten::rms_norm         2.35%       3.284ms       100.00%     140.038ms     140.038us          1000
                    aten::mul        33.61%      47.068ms        33.61%      47.068ms      23.534us          2000
                    aten::pow        17.04%      23.868ms        17.43%      24.402ms      24.402us          1000
                   aten::add_        16.52%      23.130ms        16.78%      23.497ms      23.497us          1000
                   aten::mean        15.82%      22.151ms        15.82%      22.151ms      22.151us          1000
                  aten::rsqrt        13.63%      19.085ms        13.71%      19.198ms      19.198us          1000
                   aten::item         0.46%     639.370us         0.56%     788.376us       0.394us          2000
                aten::type_as         0.21%     295.507us         0.27%     371.291us       0.371us          1000
                     aten::to         0.13%     177.742us         0.13%     177.742us       0.059us          3000
    aten::_local_scalar_dense         0.11%     149.006us         0.11%     149.006us       0.075us          2000
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 140.038ms
```

Profiling Results (After):
```
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
         aten::rms_norm        63.21%     832.875us       100.00%       1.318ms       1.318us          1000
       aten::empty_like        16.06%     211.631us        36.79%     484.681us       0.485us          1000
    aten::empty_strided        20.72%     273.050us        20.72%     273.050us       0.273us          1000
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.318ms
```

Benchmarking and profiling script:
```python
import torch
import torch.nn as nn
from torch.profiler import profile
import time

def benchmark(device, dtype):
    model = nn.RMSNorm(2048, device=device)

    # Create example inputs
    x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype)
    w = torch.randn(2048, requires_grad=False, device=device, dtype=dtype)
    eps = 1e-5

    # Check output
    y = torch.ops.aten.rms_norm(x, [2048], w, eps)
    z = torch.ops.aten.rms_norm(x.cpu(), [2048], w.cpu(), eps)
    outputs_match = torch.allclose(y.cpu(), z)

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        with torch.no_grad():
            y = model(x)
            torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return outputs_match, average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

print("\nBenchmarking Results:")
print("---------------------")
print("Device            :            MPS            |           CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16")
print(f"Outputs Match     :  ", "  |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))

device = "mps"
dtype = torch.float32
model = nn.RMSNorm(2048, device=device)
x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype)

# Run and profile the model
with profile() as prof:
    with torch.no_grad():
        for _ in range(1000):
            y = model(x)
            torch.mps.synchronize

# Print profiling results
print("\n\nProfiling Results (MPS/FP32):")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145301
Approved by: https://github.com/malfet
2025-01-23 18:07:10 +00:00
a86fa779ce [BE] Fix edge case in translation validation bisector (#145414)
This patch fixes a small bug for the binary-search algorithm in
translation validation bisector. Fixes #131303.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145414
Approved by: https://github.com/ysiraichi, https://github.com/zou3519
2025-01-23 17:35:28 +00:00
045698653a [BE] Remove test_ops_gradients from FIXME_inductor_dont_reset_dynamo (#145308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145308
Approved by: https://github.com/zou3519
ghstack dependencies: #145306
2025-01-23 17:25:04 +00:00
3a8d3785f7 [ca][bug_fix] Fix ref counting of objects in the set_autograd_compiler function. (#145482)
PR#141153 exposed the option to collect sizes as dynamic. After this
change, the function set_autograd_compiler returns PyTuple object which
is populated using PyTuple_SET_ITEM function. Yet, that function steals
reference to the object and doesn't INCREF it. So currently we are
missing INCREF on prior_compiler when it is Py_None and INCREF on
prior_dynamic which is either Py_False or Py_True. This bug may lead to
the possible memory corruption.

@xmfan @jansel @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145482
Approved by: https://github.com/albanD, https://github.com/jansel
2025-01-23 17:13:56 +00:00
c6707734de Enable non power of 2 head_dim for FlexAttention (#133495)
# Summary
- Adds support for non-power of 2 headdim by launching blocks w/ head_dim rounded to the next valid power.
- Other option I considered was building up the final dot_products with smaller blocks (this would probably work but for sake of code complexity going with this option for now)

### Corollary
We had a bug in our backwards kernel where we were using index_k instead of index_v. This should have shown up for the qk_head_dim != v_head_dim cases..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133495
Approved by: https://github.com/Chillee
2025-01-23 17:05:38 +00:00
bf4f8919df Fix test_modules_can_be_imported (#145387)
`test_modules_can_be_imported` test is currently failing due to a few missing private modules and this PR gets it working before I start to clean up the public allow list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145387
Approved by: https://github.com/albanD
2025-01-23 16:03:00 +00:00
768ad0886f Revert "Binary upload checksum (#144887)"
This reverts commit 2efa98d69d362e4ee6f15938ec8ded30bf5c40fd.

Reverted https://github.com/pytorch/pytorch/pull/144887 on behalf of https://github.com/atalman due to Broke nightly index ([comment](https://github.com/pytorch/pytorch/pull/144887#issuecomment-2610066277))
2025-01-23 15:10:42 +00:00
0802e78315 [CD] Disable Kineto for XPU Windows CD (#145255)
Due to issue #145155, disable Kineto for XPU Windows CD temporally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145255
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-01-23 14:09:52 +00:00
629840e038 Backout PEP585 use of Iterable (#145438)
Summary:
Importing Iterable from collections.abc here causes an internal product to fail
MRO discovery causing a collision between Iterable and Generic.

This fixes the failure on D68461304

Differential Revision: D68531443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145438
Approved by: https://github.com/izaitsevfb
2025-01-23 11:45:37 +00:00
cyy
29f52e3972 [2/N] Remove unnecessary once flag usage (#145057)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057
Approved by: https://github.com/albanD
2025-01-23 09:48:46 +00:00
b6941d4e42 [inductor] fix autotuning memory usage (#145410)
We use `cpu_tensor.copy_(gpu_tensor)` to clone mutated kernel arguments for autotuning. The purpose is to avoid increasing peak memory due to the clone. But if `gpu_tensor` is not contiguous, this `copy_` will need allocate an temporary tensor on GPU to store a contiguous copy of `gpu_tensor`:

6e53588789/aten/src/ATen/native/cuda/Copy.cu (L322-L334)

Here is a standalone script to illustrate this behavior: https://gist.github.com/shunting314/812a848dc67b1d674ae42415a7a462c8 . The script report 6GB rather than 3GB peak memory usage.

Note that, with all the following efforts
1. donated buffer
2. inplace padding
3. this PR

We save 3GB peak memory (18.6GB -> 15.5GB) for GPT2 model for torch.compile.

The peak memory of GPT2 is like a '...\_M\_...' shape. There are 2 places that we reach the peak. Donated buffer remove the first peak by computing grad_softmax inplace, and inplace padding removes the second peak by not allocating an extra buffer for mm-padding.

Before all these optimizations, the peak memory is 18.6GB for GPT2 with torch.compile.
With 1 & 2, the peak memory is
1. 17.7GB with a cold cache
2. 15.5GB with a warm cache (since the autotuning overhead is skipped)

With 1 & 2 & 3, we save 3GB peak memory  (18.6GB -> 15.5GB) no matter if autotuning happens or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145410
Approved by: https://github.com/masnesral, https://github.com/jansel
ghstack dependencies: #140249, #145325
2025-01-23 09:34:23 +00:00
638903aeee Adapt Dynamo tests to HPUs using instantiate_device_type_tests (#144387)
**MOTIVATION**

We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

**CHANGES**

Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

Previously we had submitted some changes in https://github.com/pytorch/pytorch/pull/140131 . However, deleted that PR due to merge conflicts and other issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144387
Approved by: https://github.com/ankurneog, https://github.com/EikanWang, https://github.com/yanboliang, https://github.com/guangyey
2025-01-23 09:24:42 +00:00
d3f196909d [inductor] let inplace-padding support cpp-wrapper (#145325)
Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth.  One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed).

The fix is when we allocate the tensor, instead of doing something like:
```
  buf0 = randn_strided([2048, 2047], [2048, 1])
```
we do some small overallocation
```
  buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1])
```

cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #140249
2025-01-23 09:22:38 +00:00
f52901a0a7 [ONNX] Remove LegacyDynamoStrategy (#145442)
It's legacy. So remove. Shouldn't affect anything and will facilitate cleaning up our legacy code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145442
Approved by: https://github.com/titaiwangms
2025-01-23 07:56:04 +00:00
28c251dd0b [BE] Remove test_modules from FIXME_inductor_dont_reset_dynamo (#145306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145306
Approved by: https://github.com/zou3519
2025-01-23 06:37:21 +00:00
f56c638849 [c10/metal] Add a vectype variant for short/int/long (#145430)
Some of the kernels (exp_complex/atan_complex) need the specialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-23 04:52:56 +00:00
c58198184b [dynamo][dicts] Insert LENTGH guard on an if condition on dict (#145432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145432
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-23 04:40:56 +00:00
faa10faa2c [ROCm] CK SDPA - Move arch check to CK patch (#144777)
__gfxXXX__ should only be visible by device code. Move the check to the ck kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144777
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell, https://github.com/jianyuh
2025-01-23 04:12:25 +00:00
5e6451ea78 [c10] catch c10 error and log message (#145413)
Summary:
Explicitly catch c10 error and log the error message only.

The standard exception `e.what()` below ends up logging the stack trace that is confusing users.
See S477887 for details.

Test Plan:
tested locally.
```
buck test caffe2/test/cpp/c10d:TCPStoreTest
buck2 daemon constraint mismatch: Version mismatch; killing daemon...
Starting new buck2 daemon...
Connected to new buck2 daemon.
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68
Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918
Network: Up: 1.5GiB  Down: 4.7GiB  (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161)
Loading targets.   Remaining      0/3024                                                                                                                                 69199 dirs read, 687558 targets declared
Analyzing targets. Remaining      0/31483                                                                                                                                1481904 actions, 1719048 artifacts declared
Executing actions. Remaining      0/250391                                                                                                                               77:11:29.7s exec time total
Command: test.     Finished 2031 local, 45445 remote, 51473 cache (52% hit)                                                                                              20:16:36.9s exec time cached (26%)
Time elapsed: 7:32.7s
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D68516080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413
Approved by: https://github.com/fduwjj
2025-01-23 03:45:47 +00:00
719938c77f Generalize pin memory logic for accelerator when non blocking copy happened (#143783)
# Motivation
fix https://github.com/pytorch/pytorch/issues/143641
Generalize pin memory logic for accelerator when non-blocking copy happened. Each accelerator has its implementation on `empty_strided`. The accelerator which doesn't have pin memory mechanism could ignore or mimic when pin_out is True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143783
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #144959
2025-01-23 03:43:05 +00:00
28b6430823 Introduce a new API isAcceleratorExcluded (#144959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144959
Approved by: https://github.com/albanD
2025-01-23 03:43:05 +00:00
5a18f1e1eb [dynamo] Support fx map_aggregate (#145351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145351
Approved by: https://github.com/zou3519
2025-01-23 03:19:30 +00:00
d95a6babcc Revert "Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)"
This reverts commit 0bff37788043626ee5e472389f88cbbbf7add997.

Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failures look legit ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2608631019))
2025-01-23 01:10:31 +00:00
0d28188cc8 Move privateuse1 test out of test_utils and make them serial (#145380)
Fixes https://github.com/pytorch/pytorch/issues/132720

The reason is that changing the privateuse1 module is global and so can race when other tests happen to check if it is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145380
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-01-23 00:31:39 +00:00
c9e12d6a3b [ROCm] Update rocm.yml and add rocm-mi300.yml (#145398)
- Added another workflow to run the mi300 jobs post-merge.
- Updated rocm.yml to use mi200s instead of mi300s.
- Required to get an idea of how PRs are landing on our mi200s and mi300s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145398
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-23 00:07:50 +00:00
1e32842324 Improve softmax's perf in cuda (#144679)
Fixes #144645

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144679
Approved by: https://github.com/eqy
2025-01-23 00:02:57 +00:00
d0a2e11284 [BE][export] Change custom_op registeration style (#145315)
Summary:
`test_unbacked_bindings_for_divisible_u_symint` has been flaky for a while due to

```
Tried to register an operator (mylib::foo(Tensor a, Tensor b) -> Tensor) with the same name and overload name multiple times.
```

It is likely due to when all variants of this test are being run (non-strict, retrace, serdes) simultaneously. In later tests, the operator has already been registered.

In this diff, we change registration style.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test:test_export -- -r test_unbacked_bindings_for_divisible_u_symint
```

Differential Revision: D68465258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145315
Approved by: https://github.com/zou3519
2025-01-22 23:46:51 +00:00
4803e20bc7 [S481486] Move MTIA dynamic library loading from __init__.py to a separate module (#145322)
Summary: As titled

Test Plan:
- Passed CI tests

buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu -- --exact 'ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu - test_icvr_e2e_gpu (ai_infra.distributed_ai.pyper_local_run.tests.integration_tests.test_icvr_e2e_gpu.TestIcvrE2EGpu)' --run-disabled
```

https://www.internalfb.com/intern/testinfra/testconsole/testrun/9007199320480497/

Differential Revision: D68463242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145322
Approved by: https://github.com/yuhc, https://github.com/albanD
2025-01-22 23:39:43 +00:00
35c8c31f11 Fix for failure in D68425364 (#145304)
Summary: Back out change from #145166 which causes an internal model to fail.

Differential Revision: D68459095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145304
Approved by: https://github.com/izaitsevfb
2025-01-22 23:33:02 +00:00
e6a84be3d3 [PyTorch] Add backend aot_eager_decomp_partition_with_mode (#143250)
Summary:
## Why
To make it possible to run torch dispatch mode inside compiled modules. This is to enable running MemoryTrackerMode (in next diff) to collect memory usage of compiled modules.

## What
Add a backend aot_eager_decomp_partition_with_mode.
Add an enable_log to the backend to control the compilation logging (which can be very verbose and slow the run of mode)

Test Plan:
unittest

E2e tested in the next diff which shows the memory read from the mode passed to this backend is very close to the actual job's memory snapshot.

Differential Revision: D67227144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143250
Approved by: https://github.com/bdhirsh
2025-01-22 23:20:59 +00:00
f0a210bf5d Revert "Output of nonzero is transposed, fix fake tensor (#144695)"
This reverts commit 693d8c7e945cc494bd31ad694ae4f4b6ff13b82a.

Reverted https://github.com/pytorch/pytorch/pull/144695 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68461259 ([comment](https://github.com/pytorch/pytorch/pull/144695#issuecomment-2608443589))
2025-01-22 23:04:50 +00:00
de945d78da [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-22 22:42:48 +00:00
6e53588789 Revert "[BE]: Simplify set add with set update (#145152)"
This reverts commit 0cb9b2284a31fa497d684dbc2f56398c1d1e3114.

Reverted https://github.com/pytorch/pytorch/pull/145152 on behalf of https://github.com/davidberard98 due to land race with https://github.com/pytorch/pytorch/pull/145165 broke lint ([comment](https://github.com/pytorch/pytorch/pull/145152#issuecomment-2608378172))
2025-01-22 22:14:26 +00:00
dddf52b1b9 Revert "Enable grep_linter to use -a (#144589)"
This reverts commit 3c55669b8814237e018a613a494564da5bea9f15.

Reverted https://github.com/pytorch/pytorch/pull/144589 on behalf of https://github.com/clee2000 due to the line parameter is kind of important and -a is not as important as I thought it was so I'm going to revert this ([comment](https://github.com/pytorch/pytorch/pull/144589#issuecomment-2608349155))
2025-01-22 21:55:27 +00:00
082c28c3c6 [compiled autograd] support Tensor Subclasses in AOTBackward (#144115)
Compiled autograd's initial trace traces through the AOTBackward
epilogue. The Tensor Subclass code is not traceable. This PR changes it
so that when we see Tensor Subclass constructors, we proxy nodes for
their construction into the graph.

Test Plan:
- New basic test with TwoTensor
- Existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144115
Approved by: https://github.com/jansel, https://github.com/xmfan, https://github.com/bdhirsh
ghstack dependencies: #143296, #143304, #143387, #143405, #143417
2025-01-22 21:51:07 +00:00
99dd1bf1b9 [compiled autograd] stop specializing on metadata during initial trace (#143417)
The previous PRs built up to this. We change compiled autograd's initial
trace to stop baking in metadata.

While tracing, we allocate some weirdly shaped tensors that we can put
proxies on. The initial trace should not be accessing any metadata of
these tensors (it will likely error out if it does because of how weird
the shapes are).

This involved fixing some various sites where we do specialize on the
metadata, like:
- we change CopySlices's apply_with_saved to proxy some calls
  into the graph (this change is fairly hard to split out by itself).
- we stop calling InputBuffer::add
- we delete the weird metadata from the graph so that no graph passes
  can make use of it.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387, #143405
2025-01-22 21:51:07 +00:00
ec820fe57c [compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405)
We will always proxy autograd.Function nodes in compiled autograd's
initial graph capture (previously there was an
option to proxy vs trace into the autograd.Function)

We have some requirements for the AOTBackward. Compiled Autograd runs
accumulate grad reordering passes on the AOTBackward graph directly
after the initial graph capture, so we can't just proxy a single node for it.

Instead, we:
- proxy the AOTBackward prologue function into the CA graph
- copy-paste the AOTBackward graph into the CA graph
- trace directly through the epilogue (the traced nodes go into the CA
  graph).

Tracing through the epilogue is safe (assuming no Tensor subclasses)
because the only thing the epilogue does is drop some outputs. The
Tensor subclass situation was already broken so this doesn't regress
anything but this PR sets it up to be fixed (in a followup, where we
will proxy "make_subclass" calls into the graph from the epilogue).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143405
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387
2025-01-22 21:50:56 +00:00
784bb2127c [compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387)
We define a functional version of a C++ torch::autograd::Function. The
functional version reconstructs the ctx object and then calls
backward with it.

Some more details:
- we define how to pack/unpack ctx.saved_data into an IValue. It's a
  Dict[str, IValue], so it wasn't difficult.
- every call to CppNode::apply_with_saved binds a new function to
  Python. This is because we're unable to reuse the a previously bound
  function for reasons (the schema may change depending on what the user
  actually puts into their Dict[str, IValue]).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143387
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304
2025-01-22 21:50:47 +00:00
8c7c5f7bfc [compiled autograd] Proxy a node for CopyBackwards into the graph (#143304)
CopyBackwards is a manual C++ torch::autograd::Node; we update its
apply_with_saved to proxy a functional version of it into the graph instead
of inlining into it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143304
Approved by: https://github.com/xmfan, https://github.com/jansel
ghstack dependencies: #143296
2025-01-22 21:50:37 +00:00
5531fafffe [compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296)
This PR is on the way to getting compiled autograd's initial capture to
stop specializing on Tensor metadata.

This PR changes compiled autograd's initial capture to proxy an opaque
(w.r.t. Dynamo) function into the graph for all built-in codegen'ed
autograd nodes and validate_outputs.

We changed each codegen'ed apply_with_saved (e.g.
MulBackward0::apply_with_saved) to call into Python to proxy a function
(compiled_autograd.ops.MulBackward0) into the graph. Then, we use the
node's InputMetadata to "guess" at the properties of the output Tensors
to create some new FakeTensors.

Some details:
- MulBackward0::apply_with_saved lives in libtorch_cpu, but needs to be
  call to Python via libtorch_python. There is an indirection
  (PyCompilerInterface) to do this.
- MulBackward0::apply_with_saved passes a C++ function to Python. To make
  our lives easier, every codegen'ed apply_with_saved passes a C++
  function with the same signature
  `(variable_list, ivalue_list) -> variable_list`.
- We define how to pack arbitrary C++ types into IValue via a helper
  IValuePacker struct and codegen functional variants of each builtin
  C++ autograd node (e.g. MulBackward0_apply_functional_ivalue).

MulBackward0 before this PR:
https://gist.github.com/zou3519/a80381d5fa38e970e413fcd91b0530de

MulBackward0 after this PR:
https://gist.github.com/zou3519/0c2eee8b3d8d96232b51ef430b53c5b0

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143296
Approved by: https://github.com/jansel
2025-01-22 21:50:29 +00:00
0cb9b2284a [BE]: Simplify set add with set update (#145152)
Simplifies the set update slightly to be more readable and efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-01-22 21:31:13 +00:00
9f150786bb [dynamo] Fix numpy test accuracy error induced by randomness divergence (#145293)
Previously `TestGradient.test_second_order_accurate` was failing because
of a small tolerance error (0.03... which is above the 0.03 tolerance).

Upon investigating, `np.random.random` caused some divergence between
eager and compiled randomness because in compiled we are not using
`np.random`'s random seed, rather we end up using `torch`'s. This in
turn caused numerical divergence and aforementioned accuracy issue.

This patch fixes the failure by patching the test case with
`use_numpy_random_stream=True`, which forces a graph break on
`np.random.random()` and thereby falling back to eager to ensure
consistency of the numpy randomness.

Fixes #116746.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145293
Approved by: https://github.com/lezcano
2025-01-22 20:53:02 +00:00
2efa98d69d Binary upload checksum (#144887)
Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887
Approved by: https://github.com/atalman
2025-01-22 20:46:04 +00:00
a57133e3c7 [NVIDIA] Jetson Thor Blackwell Support codegen (#145395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145395
Approved by: https://github.com/eqy, https://github.com/malfet
2025-01-22 20:13:19 +00:00
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
95ff9f0340 [Doc] Add period at the end of the sentence (#145384)
Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/145384/generated/torch.compiler.disable.html#torch-compiler-disable
Fixes https://github.com/pytorch/pytorch/issues/145365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145384
Approved by: https://github.com/huydhn, https://github.com/svekars, https://github.com/kit1980
2025-01-22 19:56:31 +00:00
3917053f63 [audio hash update] update the pinned audio hash (#145328)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145328
Approved by: https://github.com/pytorchbot
2025-01-22 19:39:03 +00:00
70ccbade83 [MPSInductor] Add gamma op (#145341)
By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #145309
2025-01-22 19:37:45 +00:00
b81209557b Fix tests broken by #145176 (#145393)
#145176 broke
test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_graph_break_on_jit_isinstance_dynamic_shapes
test/dynamo/test_repros.py::ReproTests::test_graph_break_on_jit_isinstance

this backs out the offending change until it can be fixed properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145393
Approved by: https://github.com/ZainRizvi
2025-01-22 19:33:16 +00:00
e8e3c03f96 [Test][Inductor] Fix test_tma_graph_breaks (#145271)
Per title. Before these changes, below tests:
```
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_False
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_True
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_False
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True
```

fail with the following message:
```
__________________________________________________________________ KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ___________________________________________________________________
Traceback (most recent call last):
  File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 557, in instantiated_test
    test(self, **param_kwargs)
  File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1760, in test_tma_graph_breaks
    eager_out = f(a, b)
                ^^^^^^^
  File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1740, in f
    t.element_size(),
    ^
UnboundLocalError: cannot access local variable 't' where it is not associated with a value

To execute this test, run the following from the base repo dir:
    python test/inductor/test_triton_kernels.py KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145271
Approved by: https://github.com/jansel
2025-01-22 19:18:59 +00:00
ac8ddf1150 [export][be] Clean up local imports from export [1/n] (#145287)
Summary: as title

Test Plan: CI

Differential Revision: D68449844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145287
Approved by: https://github.com/pianpwk
2025-01-22 19:09:17 +00:00
30717d25fe Move Dynamo test to skip from expected_failures (#145390)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/116105
This test is consistently failing. It shouldn't be marked as a flaky
test in the CI using the disabld tests mechanism. I'm skipping the test for now.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145390
Approved by: https://github.com/williamwen42
2025-01-22 19:06:39 +00:00
0bff377880 Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)
Fixes https://github.com/pytorch/pytorch/issues/142466.
Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case.

Test plan:
```
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859
Approved by: https://github.com/mingfeima, https://github.com/malfet
2025-01-22 17:52:53 +00:00
698106951e [dynamo] Re-enable test_fs family for dynamo (#145302)
Fixes #91467.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145302
Approved by: https://github.com/zou3519
2025-01-22 17:50:05 +00:00
057d9aff39 [S481486] [MTIA] Correct mtia.device_count() API (#145338)
Summary:
Prev: Count the number of "general" accelerators

Curr: Count the number of MTIA devices by using the MTIA runtime API

Test Plan:
```
buck test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r  test_get_device_count
```

https://www.internalfb.com/intern/testinfra/testrun/8162774572631995

Reviewed By: BoyueZheng

Differential Revision: D68472668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145338
Approved by: https://github.com/BoyueZheng, https://github.com/egienvalue
2025-01-22 17:45:15 +00:00
c27dd9cf72 Fix deprecated pytorch_sphinx_theme editable installation (#145347)
Fixes https://github.com/pytorch/pytorch/issues/145221

Pip editable install is going to be deprecated soon https://github.com/pypa/pip/issues/11457.  The fix here is just to remove it and install `pytorch_sphinx_theme` normally.

### Testing

Doc build is working with the change:

* PR https://github.com/pytorch/pytorch/actions/runs/12901499736/job/35975042345?pr=145347
* Nightly https://github.com/pytorch/pytorch/actions/runs/12901500521/job/35975046289
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145347
Approved by: https://github.com/ZainRizvi
2025-01-22 17:28:16 +00:00
288f21cc11 [MPS][BE] Prepare Gamma funcs to be moved ot headers (#145309)
----
- Use `float y = 1.0 + metal::frac(x)` instead of complex
```metal
float y = x;
int n = 0;
bool less_than_one = (y < 1.0);
// Add or subtract integers as necessary to bring y into (1,2)
if (less_than_one) {
  y += 1.0;
} else {
  n = static_cast<int>(floor(y)) - 1;
  y -= n;
}
```
- Declare them all as templates, to avoid instantiation
- Move global arrays to be local to the specific functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145309
Approved by: https://github.com/dcci
2025-01-22 16:14:06 +00:00
c2b401933f [torchbench] Fix mobilenetv2 inductor freezing fail_accuracy (#145296)
Issue: https://github.com/pytorch/pytorch/issues/144891

inductor freezing effectively enables inductor conv-batchnorm fusion. This fusion increases the accuracy error.

More context about this: https://github.com/pytorch/pytorch/issues/120545
For Timm models that are run through benchmarks/dynamo/timm_models.py with TimsRunner the tolerance was increased here:
https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/timm_models.py#L367

If to comment out  conv-batchnorm fusion as Elias suggested in Context issue, the accuracy is back.

=>
Increasing tolerace for mobilenetv2  to the same value via introducing the special configuration for tolerance for freezing only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145296
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-01-22 15:54:09 +00:00
0dbff7e4be Add MKLDNN support for Half GELU (#145339)
Add MKLDNN support for Half GELU to align with BFloat16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145339
Approved by: https://github.com/yanbing-j, https://github.com/leslie-fang-intel, https://github.com/Skylion007
2025-01-22 15:14:51 +00:00
0efa843392 Dynamic shape guards in C++ (#139899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139899
Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/jansel
ghstack dependencies: #143385, #143164
2025-01-22 14:58:35 +00:00
fbaef0ac03 Add a language option for symbolic shape guards (#143164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143164
Approved by: https://github.com/ezyang
ghstack dependencies: #143385
2025-01-22 14:58:35 +00:00
4b77ff9784 Fix PythonMod printing for C++ (#143385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385
Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305
2025-01-22 14:58:35 +00:00
079a3e0f75 [BE] Add type annotations to cudagraph_utils.py and test_cases.py (#145291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145291
Approved by: https://github.com/Skylion007
2025-01-22 14:54:45 +00:00
31c2f36989 Fix triton masked loading for non-block tl.loads (#144782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782
Approved by: https://github.com/eellison
2025-01-22 14:30:56 +00:00
3cbc8c54fd [BE][export] Remove disabled floordiv test in export (#145292)
Summary:
Removing `test_slice_with_floordiv` as it doesn't raise the Runtime Error as expected and it has been disabled since the time it was added https://github.com/pytorch/pytorch/issues/131101

For the case that we expect to fail, it actually returns an empty tensor. This is consistent with the following snippet which prints an empty tensor

```
a = torch.ones(4)
print(a[5:])
```

Test Plan: CI

Differential Revision: D68450650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145292
Approved by: https://github.com/pianpwk
2025-01-22 05:17:56 +00:00
99dbc5b0e2 PEP585 update - test (#145176)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176
Approved by: https://github.com/bobrenjc93
2025-01-22 04:48:28 +00:00
40e27fbcf2 Refactor CPUReproTests to be more vector-length agnostic (#141245)
This changes the hardcoded assumptions of a `256-bit` vector length to querying from `cpu_vec_isa` and changes relevant tests to share the logic.

Also refactored the `config.cpp.simdlen != 1` into the assertion so we stop duplicating it throughout the test cases.

Fixes issues on `128-bit` machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141245
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-01-22 04:24:45 +00:00
dcd9de79e7 [dynamo] Re-enable a AOT-Dispatch test with Dynamo (#145299)
Fixes #124590.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145299
Approved by: https://github.com/zou3519
2025-01-22 03:47:05 +00:00
3a58512613 [Inductor] inplace padding (#140249)
https://github.com/pytorch/pytorch/issues/139865

This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine.

Perf for `test_linear_and_cel`:
```
# TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=False ms=83.311

# TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=True ms=79.827
```

The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c.
- Without the feature: 182.151ms per batch, 180.9K tokens/s
- With the feature:  178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase.

Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) .

UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration.

Differential Revision: [D68340248](https://our.internmc.facebook.com/intern/diff/D68340248)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249
Approved by: https://github.com/jansel, https://github.com/eellison
2025-01-22 03:37:06 +00:00
46851022ff [Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM (#143187)
## Summary

Templated `int8xint8->int32` GEMM that uses AMX ISA (present on Intel Xeon Gen 4 & above). Any epilogues such as weight scale, activation scale, and bias are applied per output block in a fused manner .
Performs well for large values of `M` dimension (assuming canonical dimensions [`M, K`] and [`K, N`] for the activation & weight matrices'/tensors' sizes) when the activation is quantized per-token.
Also supports SmoothQuant GEMM pattern when activation is quantized per-tensor (scalar scale) or per-token (vector scale is applied as an epilogue in this case).

Also increased coverage of GEMM template for uint8 activation, int8 weight GEMM UTs for when the activation zero point is a 1D tensor (the existing implementation only accepted 0D tensors). However, some of such UTs would have to be explicitly enabled with `max-autotune` Inductor config.

## Performance data

The templated codegened fused GEMM with M=32, K=4096, N=14336 used in LLaMA3 exhibits more than 2x perf-gain compared to oneDNN qlinear + mul (for activation's scale) with 48 cores of one socket of Xeon SP 4th gen Platinum 8468 when per-token quantization is used.

For M=1, K=4096, N=14336, regardless of whether per-tensor quantization was used for activation or per-token, the perf gain was more than 3x.

Intel OpenMP & libtcmalloc had been preloaded. All cores used by the workload corresponded to distinct physical cores.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143187
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5

Co-authored-by: Leslie Fang <leslie.fang@intel.com>
2025-01-22 02:27:53 +00:00
27598cd154 [fx] move DCE rand check to import time (#145118)
Mitigates the deterministic benchmark regression: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2593411844. and maybe the dashboard issue.

fx.Node.is_impure is unexpectedly a hot spot. It gets called for every node in the graph whenever we invoke DCE, which should be okay, EXCEPT we invoke DCE on the full graph ~10 times at various stages of torch.compile, and an insane number of times (>O(parameters)) for the subgraphs traced by the pattern matcher.

I considered addressing this problem by reducing the amount of times DCE is called, but I think we can only trim the ones from the pattern matcher, which will require some refactor/caching solution that I leave out of this PR.

torch.Tag.nondeterministic_seeded is provided by native_functions.yml and is implemented as a list. Most of the time, it has <=2 elements, so it's not really worth it to turn it into a set for fast lookup.

Using the deterministic instruction count benchmarks
```python
# before
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8914894946
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8866669058
# after
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8770562314
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8779547794
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145118
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-01-22 02:23:02 +00:00
f2cfe8b59f PEP585 update - mostly toplevels (#145178)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145178
Approved by: https://github.com/bobrenjc93
2025-01-22 02:21:14 +00:00
1ce533867f Teach dynamo to handle GenericAlias without a graph break (#145240)
Dynamo wasn't handling the new PEP585 type annotations:
```
x = list[Foo]
```
Although this worked in py3.9 this was causing an `unimplemented` (Unexpected type in sourceless builder) in py3.12.

This fixes it to treat them as a BuiltinVariable.

Fixes #145226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145240
Approved by: https://github.com/anijain2305
2025-01-22 01:55:51 +00:00
2d1649bc2a Revert "[triton] Update triton pin to include warp specialization support (#145120)"
This reverts commit e261629dc85c061ee35f539ee8bd35aec9971215.

Reverted https://github.com/pytorch/pytorch/pull/145120 on behalf of https://github.com/ZainRizvi due to Reverting since the test failures area about not being able to find a version of triton to install, and this is breaking trunk as well ([comment](https://github.com/pytorch/pytorch/pull/145120#issuecomment-2606107792))
2025-01-22 01:52:36 +00:00
f2d7fe12d8 [BE][MPS] Mark gamma inputs as const (#145295)
Doubt it will change the perf, but it's good to correctly mark const inputs as const
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145295
Approved by: https://github.com/manuelcandales
ghstack dependencies: #145289
2025-01-22 01:00:53 +00:00
c106e9b4c6 [BE][MPS] Move Gamma kernels to its own file (#145289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145289
Approved by: https://github.com/manuelcandales, https://github.com/dcci
2025-01-22 01:00:53 +00:00
1908116ace [MPS][BE] Move vectypes from Quantized to utils (#145312)
That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312
Approved by: https://github.com/dcci
2025-01-22 00:37:28 +00:00
266fd35c58 Fix ExecuTorch, XLA, Triton hash updates (#145314)
Fix some stale hash updates https://github.com/pytorch/pytorch/pulls/pytorchupdatebot reported by @izaitsevfb

* XLA and ExecuTorch now wait for all jobs in pull instead of hardcoding the job names which are not correct anymore and the bot waits forever there
* Trion commit hash hasn't been updated automatically since 2023 and people have been updating the pin manually with their testings from time to time, so I doubt that it would be an useful thing to keep.

The vision update failures looks more complex though and I would need to take a closer look.  So, I will keep it in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145314
Approved by: https://github.com/izaitsevfb
2025-01-21 23:24:21 +00:00
1e8d6d6f0e [SkipFiles] New modules added to torch.* are inlined by default (#145279)
This PR:
- makes it so that new modules added to torch are inlined by default
- adds a list of the previously "skipped by default" modules to avoid
  regressing anything. This is a new MOD_SKIPLIST list that is consulted
  in trace_rules.check_file.
- Follow-up work will go through this list, one-by-one, and try to delete
  modules. I think we should be able to delete almost everything,
  except for torch._dynamo.

Test Plan
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145279
Approved by: https://github.com/yanboliang
2025-01-21 23:24:12 +00:00
e261629dc8 [triton] Update triton pin to include warp specialization support (#145120)
The warp specialization work has been landed to the triton rc/3.2.x branch as b2684bf3b0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120
Approved by: https://github.com/bertmaher
2025-01-21 22:14:13 +00:00
19c3ba44a2 Use TORCH_CHECK instead of std::runtime_error in stack.h and ivalue.h (#145280)
TORCH_CHECK will preserve the stacktrace for when TORCH_CPP_SHOW_STACKTRACES=1, whereas std::runtime_error will not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145280
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-21 21:58:59 +00:00
7dd9d1f243 Update clickhouse-connect to 0.8.14 (#144915)
Corresponds to https://github.com/pytorch/test-infra/pull/6177

I only tested the slow test script but I also did testing on the new version with scripts in https://github.com/pytorch/test-infra/pull/6177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144915
Approved by: https://github.com/huydhn
2025-01-21 21:43:18 +00:00
35f5668f7e [NVIDIA] RTX50 Blackwell Support codegen (#145270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145270
Approved by: https://github.com/ezyang
2025-01-21 21:10:05 +00:00
895659cb41 Revert "Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)"
This reverts commit 07e23653cd9ef8cfda01773d94d9f76e5072528d.

Reverted https://github.com/pytorch/pytorch/pull/142848 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68355212 ([comment](https://github.com/pytorch/pytorch/pull/142848#issuecomment-2605734067))
2025-01-21 21:04:45 +00:00
bac62341eb PEP585 update - torch/_inductor (#145198)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145198
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:33 +00:00
2f9d378f7b PEP585 update - torch/utils (#145201)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145201
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:10 +00:00
693d8c7e94 Output of nonzero is transposed, fix fake tensor (#144695)
Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695
Approved by: https://github.com/bobrenjc93, https://github.com/albanD
2025-01-21 20:50:09 +00:00
323fb4dad0 Unconditionally exclude upper bound in all size oblivious tests (#144867)
I was thinking about https://github.com/pytorch/pytorch/pull/144471 some more and I thought, "Hmm, why not just always exclude the constant upper bound." So here it is.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144867
Approved by: https://github.com/bobrenjc93
2025-01-21 20:44:09 +00:00
df67ac4c86 [CI][CUDA][Distributed][FSDP] Remove hardcoded world size of 2 (#145195)
as these unit tests would fail if run

on a single GPU (i.e**. skip_if_lt_x_gpu(2)) seems to view world size as 2 even on platforms with 1 GPU.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145195
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-01-21 20:25:52 +00:00
505ade7471 [inductor] Simplify mode options, only apply CompilerBisector changes once (#145232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145232
Approved by: https://github.com/yanboliang
2025-01-21 19:25:46 +00:00
85811631d7 [Intel CPU] Fix issue #143489. (#145062)
Fix issue in https://github.com/pytorch/pytorch/issues/143489.
kernel_height * kernel_weight will cause Floating point exception, so we will divide by them one by one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145062
Approved by: https://github.com/soulitzer
2025-01-21 18:38:33 +00:00
128f3627b1 Implement backward for NJT matmul (#144587)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

This PR implements missing backward support for NJT matmul. Notably, for dense tensors, matmul dispatches to bmm. However, due to historical reasons related to NST, NJT handles matmul directly, and thus can't rely on the CompositeImplicit impl of matmul to get the derivative formula.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144587
Approved by: https://github.com/soulitzer
ghstack dependencies: #144586
2025-01-21 18:27:50 +00:00
af204135d8 Fix NJT fill.Scalar for contiguous inputs (#144586)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

This PR implements the missing `fill.Scalar` support, which works fine for contiguous inputs, but there is still some AOTAutograd debugging required to handle non-contiguous transposed NJTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144586
Approved by: https://github.com/soulitzer
2025-01-21 18:22:08 +00:00
efa88e04e1 Don't overspecialize float when propagating cache guards to ShapeEnv (#145078)
Fixes https://github.com/pytorch/pytorch/issues/142507

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145078
Approved by: https://github.com/Skylion007
2025-01-21 18:05:43 +00:00
b3e90c8c33 Add support for torch function on dtype arguments (#145085)
Along the lines of https://github.com/pytorch/pytorch/issues/119194 although it doesn't actually address the FCD case.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145085
Approved by: https://github.com/vmoens, https://github.com/Skylion007
2025-01-21 17:44:47 +00:00
eb553ae3cf Fix broken gpt_fast micro benchmark after #144315 (#145235)
The benchmark is failing with the following error

```
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 333, in <module>
    main(output_file=args.output, only_model=args.only)
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 308, in main
    lst = func(device)
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 66, in run_mlp_layer_norm_gelu
    us_per_iter = benchmarker.benchmark(compiled_mod, (x,)) * 1000
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
    return fn(self, *args, **kwargs)
TypeError: benchmark() missing 1 required positional argument: 'fn_kwargs'
```

An example error is https://github.com/pytorch/pytorch/actions/runs/12862761823/job/35858912555

I also assign `oncall: pt2` as the owner of this job going forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145235
Approved by: https://github.com/nmacchioni
2025-01-21 17:42:24 +00:00
2cffbff7da Add 3.13t Windows and MacOS binary builds (#141806)
Related to: https://github.com/pytorch/pytorch/issues/130249

For conda uses approach described here:
https://conda-forge.org/blog/2024/09/26/python-313/

Create Python 3.13t conda env like so:
```
conda create -n py313 python=3.13 python-freethreading  -c conda-forge
```

For windows executable installation we need to pass additional parameter to enable 3.13t:
```
Include_freethreaded=1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141806
Approved by: https://github.com/albanD
2025-01-21 17:16:19 +00:00
0afd335174 PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175
Approved by: https://github.com/bobrenjc93
2025-01-21 16:57:27 +00:00
803017f3cb [inductor] fix MA on poor gpu (#145133)
Found this bug when debugging a MA issue in CI that can not be repro-ed on devgpu.

On GPU with less than 68 SMs (like NVidia L4 used in CI), running torch compile in max-autotune mode may result in the following confusing error https://gist.github.com/shunting314/370f42f547e3367a3773237942725a86 complaining about layout:
```
torch._inductor.exc.InductorError: LoweringException: AssertionError: convert FlexibleLayout to FixedLayout first
```
The reason is, even if we don't pick Triton template, Inductor still returns a MultiTemplateBuffer for tuned addmm. MultiTemplateBuffer.get_reads called from Reduction.num_splits may indexing a FlexibleLayout which results in the error aforementioned.

The issue does not appear on devgpu because we freeze the layout of addmm inputs when rendering triton templates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145133
Approved by: https://github.com/jansel
2025-01-21 09:31:34 +00:00
b5655d9821 PEP585 update - .ci android aten (#145177)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145177
Approved by: https://github.com/Skylion007
2025-01-21 06:31:26 +00:00
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
c6986ca2e1 Revert "[dcp] Add ZStandard transformer (#143360)"
This reverts commit 7b56b039afe2b4a4038c09d8b6cb1597823f3d5f.

Reverted https://github.com/pytorch/pytorch/pull/143360 on behalf of https://github.com/atalman due to Broke 3.13t builds please test with ciflow/binaries label attached ([comment](https://github.com/pytorch/pytorch/pull/143360#issuecomment-2603433066))
2025-01-21 01:10:16 +00:00
5fd881a5b6 Revert "PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)"
This reverts commit 54a00af2c6026a830f40d9e6a659ff81d51f9bc6.

Reverted https://github.com/pytorch/pytorch/pull/145175 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145175#issuecomment-2603418267))
2025-01-21 00:49:55 +00:00
dea7ad3371 PEP585 update - torch/testing (#145200)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200
Approved by: https://github.com/bobrenjc93
2025-01-20 22:42:42 +00:00
805c4b597a PEP585 update - torch/_higher_order_ops torch/_subclasses torch/backends torch/compiler torch/cuda torch/masked torch/mtia torch/nested (#145202)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145202
Approved by: https://github.com/bobrenjc93
2025-01-20 22:37:26 +00:00
54a00af2c6 PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175
Approved by: https://github.com/bobrenjc93
2025-01-20 22:32:59 +00:00
bd97ce0b45 PEP585 update - torch/ao (#145199)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145199
Approved by: https://github.com/bobrenjc93
2025-01-20 22:32:35 +00:00
cf05f6a134 [BE]: Improve typing for torch/fx/_pytree.py and torch/utils/_pytree.py (#145173)
Improve type inference in _pytree.py utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145173
Approved by: https://github.com/bobrenjc93
2025-01-20 22:18:19 +00:00
225a10febe [CI] Add xpu linux build into pull workflow (#145084)
To mitigate the XPU build failure risk introduced by non-XPU specific PRs. Refer #144967 & #143803
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145084
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-01-20 19:31:48 +00:00
d0100050dd [aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [2/n] (#145091)
Summary: Following up D68122536 to remove configurable aot_mode for inner_compile

Test Plan: CI

Reviewed By: desertfire

Differential Revision: D68158512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145091
Approved by: https://github.com/ydwu4
2025-01-20 19:09:10 +00:00
0b2a3687b9 PEP585 update - torch/fx (#145166)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145166
Approved by: https://github.com/bobrenjc93
2025-01-20 18:11:54 +00:00
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279bc179a6bb63f0226e24ab42a07b394.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
57b2b64acf Fix always true scaled_mm test (#143912)
Looks like `out_fp8` should use matmul without scales and `out_fp8_s` with
Scales were optional arguments before PR https://github.com/pytorch/pytorch/pull/128683
Then test_float8_scale started comparing two identical results and lost its meaning
Reason of making scales required https://github.com/pytorch/pytorch/pull/128683#issuecomment-2169146402UMBER

This PR uses scale=1.0 to compare result with scaled matmul

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143912
Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/pruthvistony
2025-01-20 16:17:46 +00:00
53e2408015 Improve cleanup of cancelled jobs on s390x for tests too (#144968)
Follow up to https://github.com/pytorch/pytorch/pull/144149
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144968
Approved by: https://github.com/huydhn
2025-01-20 12:56:07 +00:00
92b9da1fc2 fix torch.atan for torch.complex datatypes on CPU (#144749)
Fix https://github.com/pytorch/pytorch/issues/141487.
This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `atan`. For correctness, I temporarily fallback the implementation of `atan` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144749
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-20 08:45:03 +00:00
ed669a9db7 fix torch.div for torch.complex datatypes on CPU (#140375)
Fix https://github.com/pytorch/pytorch/issues/135428.
Fix https://github.com/pytorch/pytorch/issues/106845.
These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `div`. For correctness, I temporarily fallback the implementation of `div` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140375
Approved by: https://github.com/mingfeima
2025-01-20 08:34:29 +00:00
c922ccb7c4 fix sigmoid for torch.complex datatypes on CPU (#140391)
Fix https://github.com/pytorch/pytorch/issues/135777.
This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `reciprocal`. For correctness, I temporarily fallback the implementation of `reciprocal` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140391
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
ghstack dependencies: #140358
2025-01-20 08:23:58 +00:00
507bf65c6a fix torch.exp for torch.complex datatypes on CPU (#140358)
Fix https://github.com/pytorch/pytorch/issues/48010, https://github.com/pytorch/pytorch/issues/136063.
These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `exp`. For correctness, I temporarily fallback the implementation of `exp` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140358
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-20 08:03:17 +00:00
972d4a154d Add facility to run dynamo UTs for non-cuda devices (#140929)
This is in line with changes introduced with https://github.com/pytorch/pytorch/pull/130714, additional files are included to support non-cuda devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140929
Approved by: https://github.com/kwen2501, https://github.com/EikanWang, https://github.com/guangyey
2025-01-20 05:56:38 +00:00
2b809e58ad PEP585 update - torch/onnx (#145174)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145174
Approved by: https://github.com/justinchuby
2025-01-20 05:48:52 +00:00
19584b28fd [dynamo][dicts] Consolidate dict(..) construction (#144342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342
Approved by: https://github.com/StrongerXi
2025-01-20 04:42:06 +00:00
980c75fe6e [MPSInductor] Add TrueDiv and Round[Int|Decimal] (#145160)
That fixes `test_builtins_round_float_ndigits_neg` and `test_builtins_round`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145160
Approved by: https://github.com/jansel, https://github.com/dcci
2025-01-20 04:29:42 +00:00
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
b6c5562c1f PEP585 update - torch/export (#145165)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145165
Approved by: https://github.com/bobrenjc93
2025-01-19 20:56:55 +00:00
316808e4e9 PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
2025-01-19 20:55:59 +00:00
c64e657632 PEP585 update - torch/distributed/fsdp (#145162)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162
Approved by: https://github.com/bobrenjc93
2025-01-19 20:04:05 +00:00
371a361db9 Enable bfloat16 testing on MacOS14+ (#145159)
As Metal-3.1 supports this dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145159
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #145157
2025-01-19 19:35:31 +00:00
97d4d3c40a PEP585 update - torch/_export (#145138)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145138
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145154
2025-01-19 18:48:35 +00:00
cd8d0fa20c Tweak schema_check to handle annotated builtin types (#145154)
As of python 3.9 annotated lists can be written as `list[T]` and `List[T]` has been deprecated.  However schema_check was converting `list[T]` to simply be `list`. This change teaches it to handle `list[T]` the same as `List[T]`.

A couple small drive-by changes I noticed as well:
- Path concatenation should use `os.path.join`, not `+`
- Spelling in error message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145154
Approved by: https://github.com/bobrenjc93
2025-01-19 18:48:35 +00:00
9e0437a04a PEP585 update - torch/ao/quantization (#145140)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145140
Approved by: https://github.com/bobrenjc93
2025-01-19 10:20:00 +00:00
78bff1e8c1 PEP585 update - torch/_functorch (#145139)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145139
Approved by: https://github.com/bobrenjc93
2025-01-19 07:06:10 +00:00
10e4d3aebb [DCP] Fix fsspec fsync bug on .finish() (#144753)
Fixes #144752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144753
Approved by: https://github.com/Skylion007, https://github.com/saumishr
2025-01-19 03:21:00 +00:00
8cc415774f [mps/inductor] Introduce a metal approx for erf() and use it. (#145161)
Probably we can do better, but this is a start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161
Approved by: https://github.com/malfet
2025-01-19 02:29:05 +00:00
893ca1dfe1 PEP585 update - torch/_inductor/[_-i]* (#145137)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137
Approved by: https://github.com/bobrenjc93
2025-01-19 01:22:47 +00:00
cede43e06b [MPSInductor][BE] NaN-propagating min/max to header (#145157)
May be to be later reused from eager op as well

Also, didn't know that Metal already have type_traits
And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent  `a != a || b != b`, but I suspect it might have a best native implementation for the specific architecture

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157
Approved by: https://github.com/dcci
2025-01-18 22:52:44 +00:00
5b5766665d PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145102
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145105
2025-01-18 20:47:12 +00:00
a79100ab11 PEP585 update - torch/_dynamo (#145105)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105
Approved by: https://github.com/bobrenjc93
2025-01-18 20:47:11 +00:00
c95efc37ba PEP585 update - torch/distributed/tensor (#145141)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145141
Approved by: https://github.com/bobrenjc93
2025-01-18 20:01:59 +00:00
4f8237dbad [mps/inductor] Skip "double" tests as 64-bits FP is not supported. (#145123)
257 tests failed (before) -> 242 tests failed (after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145123
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-18 19:13:34 +00:00
5802be698e Revert "parametrized test name handles class arguments (#133546)"
This reverts commit 4e4b8592a32f701b4974679ab1381ba7cccd4844.

Reverted https://github.com/pytorch/pytorch/pull/133546 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but trying to disable the new tests does seem to fully cover all the cases and some are still failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/133546#issuecomment-2599814339))
2025-01-18 18:12:18 +00:00
b63b81410c Fix NJT frexp() to handle both outputs (#144585)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Before this PR, `frexp()` for NJT was handled via the unary pointwise fallback. The op returns a tuple, however, and the fallback doesn't handle that. This PR defines an explicit impl for `frexp()` that wraps both returned `(mantissa, exponent)` as NJTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144585
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583, #144584
2025-01-18 15:59:56 +00:00
3ee531f8b9 Support NJT chunk() backward on batch dim (#144584)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Implements `chunk()` backward on the batch dim, which was left out before. This PR unbinds the components and invokes `copy_()` on these to pass along the appropriate gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144584
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583
2025-01-18 15:58:24 +00:00
8a57234033 [MPSInductor] Implement i0 and i1 ops (#145092)
Using shared definitions with eager op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145092
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #145023, #145087
2025-01-18 15:41:02 +00:00
1d9fc9df38 Downgrade ignored guard to info level (#145075)
Fixes https://github.com/pytorch/pytorch/issues/101265

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145075
Approved by: https://github.com/Skylion007
2025-01-18 15:30:01 +00:00
5e4cf3e6ad Moved .all() checks for distributions to _is_all_true (#145029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145029
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-18 07:55:48 +00:00
2bf772d1ba PEP585 update - torch/_inductor/codegen (#145106)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145106
Approved by: https://github.com/bobrenjc93
2025-01-18 06:56:03 +00:00
4bf29f44b7 [aoti] Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass (#145028)
Summary:
Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass because this op is blocking fusion.

This should not have any affect on the result, because the op would not show up in the final aoti compiled model anyway (the assertion has no effect).

An real example where this improves performance:

In the example below, the post grad graph would contain `torch.ops.aten._assert_tensor_metadata.default`, because of PR  https://github.com/pytorch/pytorch/pull/142420. This op is added when functionalizing aten.to.

We want the `add` node from `linear` to be fused with the rest of the pointwise ops, instead of fused with the `mm` from `linear`.

```

class Model(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Model, self).__init__()
        self.linear = nn.Linear(input_dim, hidden_dim).half()
        self.rms_norm = nn.RMSNorm(hidden_dim)

    def forward(self, x):
        linear_458 = self.linear(x)  # Linear layer with weights'
        # mimic the torchtune rms norm: /torchtune/torchtune/modules/rms_norm.py
        linear_458 = linear_458.to(torch.float32)
        rms_norm_34 = self.rms_norm(linear_458)  # RMS Normalization
        sigmoid_168 = torch.sigmoid(rms_norm_34)  # Sigmoid activation function
        mul_168 = sigmoid_168 * rms_norm_34  # Element-wise multiplication

        return mul_168

def main():
    with torch.no_grad():
        input_dim = 512
        hidden_dim = 256
        batch_size = 32
        model = Model(input_dim, hidden_dim).to("cuda")
        example_inputs = (
            torch.randn(batch_size, input_dim).to("cuda").to(torch.float16),
        )
        ep = torch.export.export(model, example_inputs)
        package_path = torch._inductor.aoti_compile_and_package(ep)
```

Test Plan:
CI

Differential Revision: D68303114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145028
Approved by: https://github.com/angelayi
2025-01-18 06:06:25 +00:00
dc9b77cc55 [MPS] Support includes in metal objects (#145087)
Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation.
Test using:
 -  `TestMetalLibrary.test_metal_include`
 - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact:
 ```metal
template <typename T, typename Tout = T>
void kernel
i0(constant T* input,
   device Tout* output,
   uint index [[thread_position_in_grid]]) {
  output[index] = c10::i0(static_cast<Tout>(input[index]));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087
Approved by: https://github.com/dcci
ghstack dependencies: #145023
2025-01-18 05:35:22 +00:00
2859b11bdb [pytorch/ncclx] Remove Alltoallv specialization for PTD all_to_all (#145045)
Summary:
PTD all_to_all uses a list of tensors, while ncclAllToAllv (provided
by NCCLX and RCCL) assumes that a single contiguous buffer is used.
These are fundamentally mismatched.  The list of tensors might not be
contiguous or even ordered (buffer addresses might not be in
increasing order).

This patch removes the ncclAllToAllv specialization for PTD
all_to_all, and instead let's it directly call ncclSend/ncclRecv.

Co-authored by @pavanbalaji
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145045
Approved by: https://github.com/pavanbalaji, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/ezyang
2025-01-18 05:26:55 +00:00
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
2c4281d7da Make MultiProcContinuousTest timeout configurable (#145099)
Allows test classes using MPCT to set their own timeout as a class
property, which is good enough since the processgroup is shared across
test instances and the timeout is set at processgroup init.

Also sets a default timeout of 2 minutes, which is probably (?) long
enough for reasonable tests, but can be changed if it causes flakyness.
It's preferable to have as short default timeout as possible, since when
debugging tests getting a timeout quickly helps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145099
Approved by: https://github.com/d4l3k, https://github.com/fduwjj
ghstack dependencies: #145010, #145011
2025-01-18 04:37:12 +00:00
bdfeda5c9a composability test cleanup (#145011)
minor changes to test public PP api instead of internal/private one and
also save a few lines of code for microbatch splitting in the process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145011
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
ghstack dependencies: #145010
2025-01-18 04:37:12 +00:00
4eea2f7496 [inductor] Fix ignored options for torch.compile (#145131)
#139833 broke `torch.compile(options=...)` so that many (all?) options passed in get completely ignored.  @alexreinking pointed this out when `options={"cpu_backend":"halide"}` did nothing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145131
Approved by: https://github.com/exclamaforte
2025-01-18 03:39:49 +00:00
668fb7dfba [ca] Use aot_eager on flex attention test (#145097)
FIXES https://github.com/pytorch/pytorch/issues/144912

The flex attention lowering incompatibilities are covered by https://github.com/pytorch/pytorch/blob/main/test/inductor/test_flex_attention.py. For the CA + flex integration, we don't actually need to test the lowering, only the frontend graph capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145097
Approved by: https://github.com/drisspg
2025-01-18 02:47:13 +00:00
55084443ca Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)
Summary:

Test Plan:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829
Approved by: https://github.com/Chillee
2025-01-18 02:39:22 +00:00
2f51d06210 basic InductorBenchmarker (#133058)
This PR adds the most basic custom benchmarker (i.e. a benchmarker that is not provided by Triton), which we call `InductorBenchmarker`. This new benchmarker is very basic in principal, and very closely follows Triton's `do_bench` implementation with slight changes such as flushing the exact L2 cache size (Triton defaults to 256mb), using a buffer zero for warmup (Triton uses the benchmarked kernel itself, I found that buffer zeroes are more consistent),  and returning the min runtime (Triton can return min, among other things, currently Inductor picks median).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133058
Approved by: https://github.com/eellison
ghstack dependencies: #144315
2025-01-18 02:35:00 +00:00
ee3e89190a refactor benchmarking to use dynamo_timed (#144315)
use dynamo_timed for all our wrapped calls, instead of our custom timer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144315
Approved by: https://github.com/eellison
2025-01-18 02:35:00 +00:00
17c3a10cbd PEP585 update - torch/_inductor/fx_passes (#145107)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145107
Approved by: https://github.com/oulgen, https://github.com/bobrenjc93
2025-01-18 02:04:29 +00:00
8e4539245e Update ci_expected_accuracy for TIMM levit_128 for further investigation (#145112)
TSIA, it looks like an upstream change, but I'm not sure from where yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145112
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-01-18 01:55:34 +00:00
0b151f260f [AOTI] Add an option to skip optimizing generated wrapper code (#144866)
Summary: In some cases, generated wrapper code faces a long cpp compilation time. As an alleviation, this PR adds an option to skip cpp compiler optimizers for the generated main wrapper function body.

D68174038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144866
Approved by: https://github.com/chenyang78, https://github.com/hl475
2025-01-18 01:44:21 +00:00
7c1fb9b1ae [inductor] Refactor CachingAutotuner so that it can pickle (#144044)
These are refactors needed for #144288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144044
Approved by: https://github.com/eellison
2025-01-18 01:44:16 +00:00
02385ed625 [Break XPU][Inductor UT] Fix broken XPU CI introduced by community changes (#145058)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145058
Approved by: https://github.com/jansel
2025-01-18 01:30:24 +00:00
c434a64f31 Delete torch._library.register_functional_op (#145110)
Fixes #117816, #117834, #117871

This has been superceded by auto_functionalized_v2. There are no
internal usages and this is private API so it is safe to delete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145110
Approved by: https://github.com/williamwen42
ghstack dependencies: #145109
2025-01-18 00:58:25 +00:00
712d9882d2 Skip test responsible for causing flakiness (#145109)
Investigation is a separate issue. For now I want to get the CI back up
and running on the other tests. The problem seems to be that
IncludeDispatchKeyGuard doesn't actually reset the state, which seems
very, very wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145109
Approved by: https://github.com/williamwen42
2025-01-18 00:58:25 +00:00
c338dda6be fix test_rng bisector test (#143662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143662
Approved by: https://github.com/zou3519
2025-01-18 00:15:38 +00:00
d02c396fbb add fp8 support to index_cuda (#144747)
Fixes #133605

**Summary**

This PR adds support for FP8 data types to the `index_cuda` op.

It uses `AT_DISPATCH_V2` which is a new macro that can handle arbitrary number of dtypes, as opposed to the old implementations which had a separate macro for each possible number of dtype arguments (e.g. `AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND{2,3,4,5...}`).

**Test plan**

Updated test `index_cuda_with_cpu` in `test/test_fake_tensor.py` to have cases for all dtypes handled by `index_cuda`, including fp8 dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144747
Approved by: https://github.com/vkuzo
2025-01-17 22:53:23 +00:00
4e4b8592a3 parametrized test name handles class arguments (#133546)
Previously, parametrized tests with class arguments, for example

```
@parametrize("this_cls", (Foo, Bar))
```

would create parametrized tests with names `test_foo_this_cls0` and `test_foo_this_cls1`. With this change, we instead should get `test_foo_this_cls_Foo` and `test_foo_this_cls_Bar`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133546
Approved by: https://github.com/eellison
2025-01-17 22:48:38 +00:00
64e54d5af6 [Pipelining] Relax scale_grads assert (#145010)
The assert felt morally valid- if no gradients are scaled, then something
is definitely wrong with the setup.  In one instance, PP +
optimizer-in-backward (in torchtitan) resulted in grad=None after
running .backward() and before scaling grads.

On the other hand, the existing assert is too restrictive.  It's
possible that a model used with pipelining would have some parameters
that do not receieve gradients, and we shouldn't hard-error in these
cases.  (E.g. if the parameter is literally not used, or is frozen).
In the extreme case, the whole stage could be frozen.  So we do not
complain if no grads are scaled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145010
Approved by: https://github.com/mori360, https://github.com/tianyu-l
2025-01-17 21:33:28 +00:00
07e23653cd Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)
Fixes #140092

Here's what this PR does:

In before, we create a `scalar_t eps_val;` variable, and the `eps` is mostly a double scalar which passed from python frontend, like 1e-6.

While we do `eps_val = std::numeric_limits<at::scalar_value_type<scalar_t>::type>::epsilon();` or `eps_val = eps.value();`, we down cast this epsilon to match input tensor dtype (`scalar_t`), in case of BFloat16, the 1e-6 double would be cast to `1.00136e-05`.

However, while we act `auto rqrst_input = rsqrt(at::pow(upcasted_input, 2).mean(dims_to_reduce_ref, /*keepdim=*/true).add_(eps_val));`, we up cast `eps_val` to match the `opmath_t`, the conversion between these two dtypes is UNNECESSARY, so we could just make the `opmath_t eps_val` instead of `scalar_t`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848
Approved by: https://github.com/mikaylagawarecki
2025-01-17 21:30:54 +00:00
a8ef423fed Fix NJT min / max backward() for non-ragged reductions (#144583)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

`value_selecting_reduction_backward()` is used in the backward for min / max, so this PR implements it for NJT. Notably, this isn't enough for reducing over the ragged dim, since that results in a dense tensor and thus NJT's torch_dispatch will not be called for this op. We need factory function support for nested ints to fix that case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144583
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582
2025-01-17 20:57:11 +00:00
cac10b8190 Fix NJT OpInfo entry for nn.functional.prelu (#144582)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

The OpInfo entry for prelu was wrong before this PR; `weight` needs to be passed as well. The op isn't fully implemented yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144582
Approved by: https://github.com/soulitzer
2025-01-17 20:36:15 +00:00
eaef613688 Fix issue with test/nn/test_convolution:TestConvolutionNNDeviceTypeCUDA.test_conv_large_batch_1_cuda (#145067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145067
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia

Co-authored-by: Wei Wang <143543872+nWEIdia@users.noreply.github.com>
2025-01-17 20:31:25 +00:00
0eda02a94c Prevent legacy_load when weights_only=True (correctly) (#145020)
Only prevent `legacy_load` (.tar format removed in https://github.com/pytorch/pytorch/pull/713), not the whole of `_legacy_load` (.tar format + _use_new_zipfile_serialization=False)

Differential Revision: [D68301405](https://our.internmc.facebook.com/intern/diff/D68301405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145020
Approved by: https://github.com/kit1980, https://github.com/albanD
2025-01-17 20:10:22 +00:00
2ef7b68666 [inductor] fix TORCH_LOGS="benchmarking" (#144997)
Saw this error with TORCH_LOGS="benchmarking"
```
  File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 37, in wrapper
    result = fn(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 66, in wrapper
    return fn(self, *args, **kwargs)
torch._inductor.exc.InductorError: TypeError: Benchmarker.benchmark() missing 1 required positional argument: 'fn_kwargs'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144997
Approved by: https://github.com/eellison, https://github.com/nmacchioni
2025-01-17 19:41:18 +00:00
d996d7ec13 upgrade to sccache 0.9.1 - dealing with nvcc -E correctly (#145012)
sccache 0.9.1 should be dealing with `nvcc -E` correctly

see https://github.com/mozilla/sccache/pull/2300

If this works as expected, we can get rid of this code:
https://github.com/pytorch/pytorch/pull/142813/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145012
Approved by: https://github.com/malfet
2025-01-17 19:26:01 +00:00
46fbd63405 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-17 18:21:22 +00:00
18638b91fe Adding more compile time logging in pad_mm (#144884)
Summary: As title

Test Plan:
[midin@6262.od /data/sandcastle/boxes/fbsource/fbcode (99e64d2e4)]$ tlp buck run mode/opt caffe2/test/inductor:pad_mm -- -r test_exclude_padding

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json&local_cache_key

 {F1974355662}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144884
Approved by: https://github.com/oulgen
2025-01-17 17:35:55 +00:00
567552b98b fix typo in doc and import for torch._library.triton (#144882)
Previously, the doc's suggested `from torch._library.triton import wrap_triton, triton_op` doesn't work because wrap_triton is not imported in torch/_library/__init__.py but `from torch.library import wrap_triton` works. This PR imports wrap_triton and fix the doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144882
Approved by: https://github.com/zou3519
2025-01-17 17:32:12 +00:00
18eba9575f [Accelerator] Use uniform GetAllocator for devices in new_qtensor function (#144849)
Fixes #144848
This PR is intended to use a uniform `GetAllocator()` call for all the accelerators for `new_qtensor` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144849
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-01-17 16:37:37 +00:00
a215e174a1 [BE] Remove conda from scripts and build files Part 2 (#145015)
Continuation of https://github.com/pytorch/pytorch/pull/144870

Remove conda logic from scripts:

1. Remove conda build from triton build script
2. Remove conda checks from setup.py
3. Remove conda from release scripts
4. Script read_conda_versions.sh is not used (checked via git grep)

Related to: https://github.com/pytorch/pytorch/issues/138506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145015
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-01-17 16:26:24 +00:00
b7af210d8d Add SM89 support for f8f8bf16_rowwise() (#144348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144348
Approved by: https://github.com/drisspg
2025-01-17 15:12:35 +00:00
f522502b97 Revert "patch for block-wise quantization + pt2e (#144492)"
This reverts commit 1d43b8150852cdfcbe754edcf027d6e25f33ac63.

Reverted https://github.com/pytorch/pytorch/pull/144492 on behalf of https://github.com/albanD due to Broke a few things in trunk ([comment](https://github.com/pytorch/pytorch/pull/144492#issuecomment-2598485291))
2025-01-17 14:27:53 +00:00
dbed747aae Add Intel GPU specific CMake files to merge rules (#135110)
As the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135110
Approved by: https://github.com/atalman
2025-01-17 09:44:13 +00:00
a0d2c09115 Add flop formula for _scaled_mm (#144973)
This will make it work correctly with the partitioner's AutoAC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144973
Approved by: https://github.com/jeffdaily
2025-01-17 09:38:30 +00:00
96c0dbbe97 Enhance running pr time benchmarks locally experience. (#144838)
Summary: title

Test Plan: NA

Differential Revision: D68195894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144838
Approved by: https://github.com/huydhn
2025-01-17 07:57:40 +00:00
465a1cfe2e update get start xpu (#143183)
- Support new Intel client GPU on Windows [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) and [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html)
- Support vision/audio prebuilt wheels on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143183
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-17 06:31:40 +00:00
fd8e0e3e10 [mps/inductor] Introduce is_mps_backend/skip_if_mps decorators. (#145035)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145035
Approved by: https://github.com/jansel
2025-01-17 05:36:38 +00:00
cfd9cc19a3 [executorch hash update] update the pinned executorch hash (#145022)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145022
Approved by: https://github.com/pytorchbot
2025-01-17 04:51:56 +00:00
f13c864eda Fuzzer Improvements (#144952)
Added more tests and cleaned up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144952
Approved by: https://github.com/masnesral
2025-01-17 04:46:58 +00:00
1d43b81508 patch for block-wise quantization + pt2e (#144492)
Summary: As title, needed for enable qcom block-wise quantization kernel

Test Plan: local test

Differential Revision: D67985303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144492
Approved by: https://github.com/angelayi, https://github.com/billmguo
2025-01-17 04:10:49 +00:00
adbbcd87d9 OpenReg: Split Allocator (#144843)
Split the Allocator into HostAllocator and DeviceAllocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144843
Approved by: https://github.com/albanD
2025-01-17 03:38:15 +00:00
43a00d73b3 [Trace Python Dispatcher] Support FuncTorchInterpreter (#144444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144444
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #144439
2025-01-17 02:26:37 +00:00
5d02575aa1 [Trace Python dispatcher] Support torch.DispatchKey & torch.DispatchKeySet (#144439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144439
Approved by: https://github.com/zou3519
2025-01-17 02:26:36 +00:00
3a50aba7d3 [dynamo] add option to not skip on empty graph (#144885)
Temporary fix to https://github.com/pytorch/pytorch/issues/144360.

Turning the config on globally will cause a bunch of tests to fail, which needs to be addressed in followups.

I had a previous attempt at https://github.com/pytorch/pytorch/pull/144712, but this is a more complicated change and will likely be absorbed into work to refactor Dynamo's exception handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144885
Approved by: https://github.com/jansel
2025-01-17 02:12:20 +00:00
7b56b039af [dcp] Add ZStandard transformer (#143360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360
Approved by: https://github.com/saumishr
ghstack dependencies: #143358, #143359
2025-01-17 01:51:37 +00:00
9c909bf3bb [dcp] Integrate stream extensions into DCP impl (#143359)
Summary: Updates FileSystemReader/Writer, Planner, DefaultLoad/SavePlanner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143359
Approved by: https://github.com/saumishr
ghstack dependencies: #143358
2025-01-17 01:51:37 +00:00
ba3f1c29ee [dcp] Add extension mechanism (#143358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143358
Approved by: https://github.com/saumishr
2025-01-17 01:51:37 +00:00
176cde6240 Use torch with statement in torch distributed module (#144951)
# Motivation
In https://github.com/pytorch/pytorch/pull/137678, we help use the device-agnostic APIs to generalize distributed module. As this [comment](https://github.com/pytorch/pytorch/pull/137678#discussion_r1828645683) said, we will use the with statement of `torch.Stream` once https://github.com/pytorch/pytorch/pull/140138 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144951
Approved by: https://github.com/kwen2501, https://github.com/albanD
2025-01-17 01:49:28 +00:00
a61a65ff82 [MPSInductor] Add Worker.current_device method (#145023)
That just returns 0, as multi-gpu is not currently supported by MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145023
Approved by: https://github.com/dcci
2025-01-17 01:41:01 +00:00
55b0819bee Revert "Add tests for different dtypes with max autotune (#144721)"
This reverts commit d2a77f48c9dc6df056051de270ce5875d8d2edd0.

Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/kit1980 due to breaking internal builds, max autotune tests a failing, see D68297606 ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2597250605))
2025-01-17 01:36:14 +00:00
45e6647268 [FSDP2] Make post-backward condition more robust (#144781)
Fixes https://github.com/pytorch/pytorch/issues/144755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144781
Approved by: https://github.com/fegin
2025-01-17 01:28:56 +00:00
6077102415 [DSD][BE] Rewrite some tests to remove with_comms (#143241)
Summary:
This saves ~ 1 minute test time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143241
Approved by: https://github.com/mori360, https://github.com/XilunWu
ghstack dependencies: #143240
2025-01-17 01:15:55 +00:00
5d54e7b812 [Pipelining] move scale_grads to base class, add docs (#144833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833
Approved by: https://github.com/H-Huang
2025-01-17 01:07:12 +00:00
3afc5170d4 [Submodule] Upgrade to Cutlass 3.6 part deux (#144911)
# Summary
Take 2 of [D67866269](https://www.internalfb.com/diff/D67866269)
Main change is that we identified and fixed the FA2 regression. More details can be found here https://github.com/pytorch/pytorch/issues/144729 and have landed that before this here: [D68194635](https://www.internalfb.com/diff/D68194635)

Differential Revision: D68194470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144911
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-17 00:53:42 +00:00
6c713ccb5e Revert "Make functionalization ViewMeta serializable with pickle. (#143712)"
This reverts commit b8abdaa286fd161af48af57a675827f4f849914d.

Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))
2025-01-17 00:52:50 +00:00
42c64bd35c [MPSInductor] More is_dtype_supported gating (#144981)
This makes `GPUTest.test_scalar_cpu_tensor_arg_mps` pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144981
Approved by: https://github.com/dcci
ghstack dependencies: #144971
2025-01-17 00:48:02 +00:00
94c0f15302 Revert "cpp_wrapper: Move #includes to per-device header files (#143909)"
This reverts commit d62b3979dadfa4928ec1c76e850f874d49803125.

Reverted https://github.com/pytorch/pytorch/pull/143909 on behalf of https://github.com/kit1980 due to breaking internal builds because of removal of torch‎/_inductor‎/codegen‎/aoti_runtime‎/implementation.cpp‎ ([comment](https://github.com/pytorch/pytorch/pull/143909#issuecomment-2597188669))
2025-01-17 00:36:38 +00:00
5e6e6200bf Revert "[dynamo][dicts] Consolidate dict(..) construction (#144342)"
This reverts commit a54a784b8207617d2b99fbded9bb34c94fb6dd23.

Reverted https://github.com/pytorch/pytorch/pull/144342 on behalf of https://github.com/kit1980 due to breaking internal builds, see D68125388 ([comment](https://github.com/pytorch/pytorch/pull/144342#issuecomment-2597184167))
2025-01-17 00:32:09 +00:00
cyy
2ea394ba29 Modernize C++ code (#144603)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144603
Approved by: https://github.com/malfet
2025-01-17 00:25:18 +00:00
c3fcb3606d Profile compile_inner instead of _compile_inner (#144930)
Summary: title

Test Plan: NA

Reviewed By: jamesjwu

Differential Revision: D67990492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144930
Approved by: https://github.com/jamesjwu
2025-01-16 23:59:27 +00:00
573fc42f25 [BE][CP] Use run_subtests instead of parametrize (#143240)
Summary:
This provides a 15X increase in test performance speed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143240
Approved by: https://github.com/XilunWu
2025-01-16 23:55:05 +00:00
fea9d18d5a [Utilization Log] Concurrently collect aggregate data during the output interval (#143235)
# overview
Add worker to collect metrics in short intervals
1.Worker: Add a worker to collect usage metrics, by default, every 500ms, notice this is configurable
2.Calculate &  avg and max as data point, by default, every 5 second.

# Other
clean up the log format for necessary needs, currentl we do not need to track gpu processesors etc, or all pids from psutil
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143235
Approved by: https://github.com/huydhn
2025-01-16 23:52:43 +00:00
288d67d6c2 [inductor] [bug fix] align avg_pool with eager when handling uint (#144313)
Fixes #144310

~~We just need to add a check in lowering~~

updated: we add the error checking in `meta registration`

### UT
```
 pytest -s -v test/inductor/test_torchinductor.py -k test_avg_pool_errors_with_uint
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144313
Approved by: https://github.com/jansel, https://github.com/jgong5
2025-01-16 23:37:51 +00:00
d2a77f48c9 Add tests for different dtypes with max autotune (#144721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721
Approved by: https://github.com/cpuhrsch, https://github.com/etaf
2025-01-16 23:04:56 +00:00
clr
171fb7f358 easy: Fix missing tab in test/dynamo/test_compile.py (#145013)
It turns out that if you request a merge on a pytorch PR, and then push a fix for a bad rebase, and the test is
relativley new, the merge will go through with the previous commit and not notice the test break.

Explicitly running the test now passes vs failing, and this is just the last missing commit from https://github.com/pytorch/pytorch/pull/144817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145013
Approved by: https://github.com/masnesral, https://github.com/jansel
2025-01-16 22:51:51 +00:00
181d93b4f2 [BE] Move is_device_supported to helper function (#144971)
And extend `test_inf` to check half (explicitly instead of check_lowp) and bfloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144971
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/jansel
2025-01-16 22:44:03 +00:00
a33e02cb26 [executorch hash update] update the pinned executorch hash (#144813)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144813
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-01-16 22:39:00 +00:00
7c7bcb1e33 update IS_JETSON check (#144725)
update IS_JETSON check to include the latest SM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144725
Approved by: https://github.com/eqy
2025-01-16 22:34:48 +00:00
95c363cc9b dynamo: Don't crash with internal error if getattr on a tensor fails (#144817)
This prevents crashes when getattr is called on a tensor for something
which doesn't exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144817
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-16 22:04:06 +00:00
0e6d44df3f Add heuristic to fail block pointer match early (#144681)
This PR adds a heuristic to potentially fail the block pointer match early. Expressions like below take a long time to match using sympy (e.g. > 100 seconds)
```python
# torch._inductor.config.triton.use_block_ptr = True
# torch._inductor.config.triton.prefer_nd_tiling = True
# Expression from pytest -k test_max_pool2d1_dynamic_shapes_cuda:
 ((xindex//ps1))*((s2 - 3//2))**2 + 2*((xindex//ps1))*((s2 - 3//2)) + ((xindex//ps1)) + ((s2 - 3//2))*(ModularIndexing(xindex, ps0, ps0)) + (ModularIndexing(xindex, 1, ps0)) + (ModularIndexing(xindex, ps0, ps0))
```
Additionally, the heuristic for the number of dimensions based on the indexing expression is refined to only add dimensions for FloorDiv(index, denom) and ModularIndexing(index, denom, modulo) instead of including FloorDiv/ModularIndexing expressions that don't involve the index.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144681
Approved by: https://github.com/jansel
2025-01-16 21:57:30 +00:00
46b92c025d Revert "Cholesky mps implementation (#144193)"
This reverts commit 727ae1331820bb3d83d70e9cd3c9d3cd4c79ff56.

Reverted https://github.com/pytorch/pytorch/pull/144193 on behalf of https://github.com/malfet due to Alas, inductor changes broke inductor tests, see aa4a1ff027/1 ([comment](https://github.com/pytorch/pytorch/pull/144193#issuecomment-2596938163))
2025-01-16 21:37:32 +00:00
aa4a1ff027 Revert "Prevent _legacy_load with weights_only=True (#144914)"
This reverts commit 7c3aa1da1c97812af54d41f3f0eff2ef922c0f32.

Reverted https://github.com/pytorch/pytorch/pull/144914 on behalf of https://github.com/izaitsevfb due to breaking inductor on trunk ([comment](https://github.com/pytorch/pytorch/pull/144914#issuecomment-2596922781))
2025-01-16 21:29:50 +00:00
4ea189422d Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit a6763b7b81cd1a55c8316dfdb5bca19819a1429a.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))
2025-01-16 21:12:41 +00:00
3a5bf0bc36 expose extra torch_python apis (#144746)
Fixes #144302
After checking the code of my third-party devices, I think these APIs are also relied on by us, so I exposed them according to the discussion in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144746
Approved by: https://github.com/albanD
2025-01-16 20:50:31 +00:00
577708e6de Unskipped multiple inductor tests for ROCm (#143581)
All of them should be fine to run now after the triton fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-16 20:46:06 +00:00
a9bfc5f70c Fix boundary conditions for hardswish backward (#143899)
Fixes #136345.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143899
Approved by: https://github.com/jgong5, https://github.com/ezyang
2025-01-16 20:26:27 +00:00
aad5f600ff [mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)
Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877
Approved by: https://github.com/jansel, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-16 19:51:45 +00:00
3908be676c Fix loading older state_dict into AdamW after refactor (#144972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144972
Approved by: https://github.com/albanD
2025-01-16 19:50:31 +00:00
b8abdaa286 Make functionalization ViewMeta serializable with pickle. (#143712)
Fix: #141974

This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.

In summary, these are the changes this PR introduces:

- `ViewMeta` is turned into an abstract class (see
  _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
  functions that need to be implemented. `to_out_index` should be
  implemented by operations that might return more than 1 output.

- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
  created (see _FunctionalizeFallbackKernel.h_).

- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
  declaration and definition of the `ViewMeta` specializations, which
  are automatically generated in the ATen codegen (see _gen.py_).

- New `_functionalization` Python sub-module is created (see
  _Module.cpp_). It serves as namespace for the `ViewMeta`
  specializations and `InverseReturnMode` enum.

- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
  the automatically generated Python bindings for the `ViewMeta`
  specialization, which are generated in the torch codegen (see
  _generate_code.py_).

Note that this PR makes use of codegen at 2 different moments:

- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
  them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
2025-01-16 19:41:41 +00:00
7c3aa1da1c Prevent _legacy_load with weights_only=True (#144914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144914
Approved by: https://github.com/malfet, https://github.com/albanD
2025-01-16 19:33:46 +00:00
cf28d613f1 Allow ROCm runner to upload benchmark results if found (#144710)
https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database. This will unblock AMD when they try to run benchmark MI300 benchmarks on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144710
Approved by: https://github.com/kit1980
2025-01-16 19:31:45 +00:00
31a73eb712 fix acquire pattern in topk (#144945)
Similar to #128455, topk needs another threadfence to complete acquire pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144945
Approved by: https://github.com/Skylion007
2025-01-16 19:20:43 +00:00
3004b657f0 [Inductor][FlexAttention] Supports dynamic shapes with custom kernel options (#144938)
Fixes #144815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144938
Approved by: https://github.com/drisspg
2025-01-16 19:02:35 +00:00
e32d2bf853 Document decoupled_weight_decay for Adam for consistency with N/RAdam (#144984)
Followup from #144972 and #143710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144984
Approved by: https://github.com/albanD
2025-01-16 18:58:29 +00:00
ad15436db6 Fix pt2-bug-report.yml formatting (#144987)
This is a 2nd regression caused by https://github.com/pytorch/pytorch/pull/144574

Test plan: `python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"`
Before it printed
```
% python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"
{'type': 'markdown', 'attributes': {'value': ''}}
```
After
```
% python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"
{'type': 'markdown', 'attributes': {'value': '#### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.\n'}}
```

Fixes https://github.com/pytorch/pytorch/issues/144970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144987
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-16 18:58:07 +00:00
829c4570ca Revert "[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)"
This reverts commit 1b34665767fcc35ae4a8f211945a24701c79df79.

Reverted https://github.com/pytorch/pytorch/pull/144877 on behalf of https://github.com/malfet due to Actually no, lint is red ([comment](https://github.com/pytorch/pytorch/pull/144877#issuecomment-2596385712))
2025-01-16 18:10:37 +00:00
13d35ea67a [BE] Add missing throw of std::runtime_error in scrc/cuda/utils.cpp (#144962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144962
Approved by: https://github.com/amjames, https://github.com/Skylion007, https://github.com/malfet
2025-01-16 17:35:39 +00:00
53256edff9 [export] Support module inputs for non strict mode. (#143925)
Summary:
Add experimental support for torch.nn.Module as input types.

Before this change, we don't support module inputs but recently we saw some interesting use cases like gpt-fast https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L68 where we directly pass in a module input for different variants of the same models.

Since we don't really care about non-param or non-buffer states in non strict mode, we don't care about those either and pretend they are like plain constants during tracing. We treat any module input like a nested container of tensor, and each time we will automatically register a pytree handler for these module types to flatten its state dict into a group of tensors. We will just inline any module method call during tracing like we did for `self` module in export_for_training. This will make input modules' behavior very similar to the training module in typical case, except that we don't record the inputs as parameter or buffers but rather just plain user inputs.

Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_module_input

Differential Revision: D67680827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143925
Approved by: https://github.com/tugsbayasgalan
2025-01-16 17:30:36 +00:00
519269a415 [BE] - Remove conda test and upload scripts and env variables from Workflows Part 1 (#144870)
Remove conda test and upload scripts and env variables from Workflows

Related to: https://github.com/pytorch/pytorch/issues/138506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144870
Approved by: https://github.com/malfet
2025-01-16 17:20:14 +00:00
727ae13318 Cholesky mps implementation (#144193)
Requested in #77764

PR is still in draft because it needs some cleanups and optimizations to get to cpu performance the least. Tasks:
- [x] Make `upper=True` work, only `upper=False` works now
- [x] Code cleanup
- [x] Optimizations(Though might need some help on this)(tried my best, maybe there is still some more to squeeze out)
- [x] Checks for positive definite input
- [x] Support for (*, N, N) input, currently only supports (B, N, N) input
- [x] Support other dtypes(float16, bfloat16)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144193
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-16 16:26:46 +00:00
1b34665767 [mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)
Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-16 16:22:06 +00:00
3d29de3ac8 [aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [1/n] (#144709)
Summary:
According to angelayi, these two flags indicated different things when we have two-pass codegen but since now we basically keep the two flags all the same, we should merge two flags.

This can prevent some bug (e.g. we change value of aot_mode which will not cover branches like if V.aot_compialtion is True) from happening when we're trying to add different code paths to tweak the value of aot_mode in the future.

Test Plan: CI

Differential Revision: D68122536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144709
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-01-16 16:02:18 +00:00
241a8a101b Fix erroneous at_vreinterpretq_u16_bf16 call (#144883)
Here, `mask` is definitely a `uint16x8_t`, not an `at_bfloat16x8_t`, so we shouldn't be reintepreting it. Candidate fix for #144818 .

Differential Revision: [D68224128](https://our.internmc.facebook.com/intern/diff/D68224128/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144883
Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/malfet
2025-01-16 15:16:28 +00:00
6559374494 Revert "Add flop formula for _scaled_mm (#144872)"
This reverts commit f31452268bf9f7e395f263cd8a9d693633ea75ce.

Reverted https://github.com/pytorch/pytorch/pull/144872 on behalf of https://github.com/lw due to Breaks ROCm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/144872#issuecomment-2595994134))
2025-01-16 15:16:18 +00:00
6470b0ea6f Update torch-xpu-ops commit pin (#144739)
Update the torch-xpu-ops commit to [22cc419e4e60f469341712a5a103fa309a7dfd48](22cc419e4e), includes:

- Fix building issue https://github.com/intel/torch-xpu-ops/issues/1279
- Aten operator coverage improvement

Note: new torch-xpu-ops commit don't support bundle 0.5.3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144739
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-01-16 15:12:37 +00:00
f31452268b Add flop formula for _scaled_mm (#144872)
This will make it work correctly with the partitioner's AutoAC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144872
Approved by: https://github.com/vkuzo
2025-01-16 13:57:54 +00:00
1c290912e4 Revert "Add tests for different dtypes with max autotune (#144721)"
This reverts commit 9e568cbaa22df89b77e112f1a373d82acb2e6219.

Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2595210355))
2025-01-16 10:59:30 +00:00
0c0583254e [inductor] fix index.Tensor fallback (#144736)
The original issue is we see accuracy problem in a meta internal model [meta internal link](https://fb.workplace.com/groups/1075192433118967/posts/1567334737238065/).  The debugging is hard but the root cause is relatively simple. The root cause is that the model has mix-device inputs for index.Tensor which causes Inductor to fallback. And the meta kernel for index.Tensor returns a tensor with inconsistent strides to the eager kernel.

The following code snippet
```
import torch
from torch._subclasses import FakeTensorMode

device = "cuda"

x = torch.randn((24, 16, 32, 32), device=device).to(memory_format=torch.channels_last)
x = x.view(2, 12, 16, 32, 32)

i1 = torch.arange(2).unsqueeze(-1)
i2 = torch.argsort(torch.rand(2, 12), dim=-1)[:, :3]

print(f"Eager stride: {x[i1, i2].stride()}")

mode = FakeTensorMode()
with mode:
    f_x = mode.from_tensor(x)
    f_i1 = mode.from_tensor(i1)
    f_i2 = mode.from_tensor(i2)
    f_out = f_x[f_i1, f_i2]
    print(f"Meta stride: {f_out.stride()}")
```

would output:
```
Eager stride: (49152, 16384, 1, 512, 16)
Meta stride: (49152, 16384, 1024, 32, 1)
```

In this PR, I fix the problem to run eager kernel to get the index.Tensor fallback's output layout. A better solution would be to change meta/eager kernel implementation so that their output layout matches. But I'm not sure how to properly do that.
In the index.Tensor meta kernel, we always produce dense output:  6d56277682/torch/_meta_registrations.py (L3184) . While the eager kernel seems to leverage TensorIteratorBase to decide some dimension permutation: 6d56277682/aten/src/ATen/TensorIterator.cpp (L232-L308) .  We can duplicate this logic to the meta kernel implementation if we really want meta matches eager. I can follow up on this if people have strong opinion to do this.

And here is an issue https://github.com/pytorch/pytorch/issues/144717 for asserting size/strides for fallback kernels. With that, the issue debugged here would be much easier to root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144736
Approved by: https://github.com/jansel
2025-01-16 09:38:29 +00:00
57d5659c3b XFAIL test_save_load_checkpoint (#144927)
Fixes https://github.com/pytorch/pytorch/issues/137771

The issue keeps showing up and rerun disable tests couldn't reproduce the issue.  So, XFAIL it while waiting for distributed team to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144927
Approved by: https://github.com/kit1980, https://github.com/malfet
2025-01-16 07:31:56 +00:00
7d8c087e24 [Pipelining] Improve shape inference debug logging (#144929)
Remove log that just said "running forward" since that is not so useful
in itself, replace with somewhat equivalent log that reports both input
and output shapes after running forward.

Note: enabled by `TORCH_LOGS=+pp`

Example:
```
[rank0]:V0115 13:28:58.282000 3908366 torch/distributed/pipelining/stage.py:1400] Shape inference: stage 0 inputs (tensor(..., device='meta', size=(1, 64), dtype=torch.int64),), outputs (tensor(..., device='meta', size=(1, 64, 256), dtype=torch.bfloat16),)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144929
Approved by: https://github.com/H-Huang
2025-01-16 07:30:11 +00:00
0b17c09893 restore rng generation for fbcode (#144819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819
Approved by: https://github.com/malfet, https://github.com/kit1980
2025-01-16 06:46:26 +00:00
49bdc418be Add strict kwarg to nn.Module.set_submodule and fix bug for non dot delineated strings (#143455)
Before fixing set_submodule, it used to create leaf modules when the target was not a dot-delimited string. After the fix it will not create a new attribute if target is a non-dot-delimited string. If you want to create leaf nodes of `nn.Module` parent nodes, you can use `replace_or_create_new_leaf_module`.

Fixes https://github.com/pytorch/pytorch/issues/143441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143455
Approved by: https://github.com/mikaylagawarecki
2025-01-16 05:06:33 +00:00
e3c4d1b7d6 [c10d][fr] Fix the bug when we still mark mismatch when there are match case (#144916)
When we introduce partial match, we accidentally introduce the mark of mismatch for the full match case. This is wrong and this PR fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144916
Approved by: https://github.com/c-p-i-o
2025-01-16 04:36:30 +00:00
9e568cbaa2 Add tests for different dtypes with max autotune (#144721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721
Approved by: https://github.com/cpuhrsch, https://github.com/etaf
2025-01-16 04:29:44 +00:00
52a620845b OpenReg: Use device agnostic API (#144840)
Use `torch.accelerator.device_count()` to get the number of devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144840
Approved by: https://github.com/albanD
2025-01-16 03:31:52 +00:00
1230de4c1b [Quant][Inductor][X86] Separate binary post op fusion and lowering for qconv (#144318)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves binary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise`
2. Fuse `onednn.qconv2d_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144318
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #144224, #144312
2025-01-16 03:30:36 +00:00
cyy
843627b7b1 Remove unnecessary once flag usage (#143255)
Static variables in C++11 is guaranteed to be initialised exactly once, as mentioned [here](https://en.cppreference.com/w/cpp/language/storage_duration)
```
If multiple threads attempt to initialize the same static local variable concurrently,
the initialization occurs exactly once
(similar behavior can be obtained for arbitrary functions with std::call_once.
Usual implementations of this feature use variants
of the double-checked locking pattern,
which reduces runtime overhead for already-initialized local statics
 to a single non-atomic boolean comparison.
```
Given that static c10::once_flag is used before, why not just use the associated function to initialised the related static variables? That is the motivation behind this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143255
Approved by: https://github.com/albanD
2025-01-16 02:36:11 +00:00
41ec2e8d3e [MPSInductor] Fix codegen regression (#144924)
Caused by https://github.com/pytorch/pytorch/pull/144649

Do not try to insert anything into the header if wrapper is not ready yet

Fixes `test_sort_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144924
Approved by: https://github.com/dcci
ghstack dependencies: #144827, #144917
2025-01-16 02:12:42 +00:00
05505771a0 [MPSInductor] Properly convert index (#144917)
By calling `self.index_to_str` from `load`,`store` and `check_bounds` in order to properly handle sizevars variables renames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144917
Approved by: https://github.com/dcci
ghstack dependencies: #144827
2025-01-16 02:12:41 +00:00
d595b96059 Revert "restore rng generation for fbcode (#144819)"
This reverts commit 2bc18a905544f4e25cfbd354351418b36a0f5fc1.

Reverted https://github.com/pytorch/pytorch/pull/144819 on behalf of https://github.com/ngimel due to internal failure ([comment](https://github.com/pytorch/pytorch/pull/144819#issuecomment-2594298941))
2025-01-16 01:52:29 +00:00
6492851125 symbolic_convert: Don't fail when we hit a undefined name (#144784)
We're using a python builtin NameError here,
instead of throwing a Unsupported exception. This causes the
NameError to get wrapped in a InternalTorchDynamoError
instead of just causing a graph break, and letting the user code fail
directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144784
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-16 01:47:48 +00:00
c8bcb22e5f Default Copies are not vectorized in v3.6.0 of cutlass (#144837)
Summary:
FlashAttentionV2 perf was tanked in v3.6.0, See: https://github.com/pytorch/pytorch/issues/144729 for more details.

This PR makes it possible to land v3.6.0 update and fixes perf regression. See: https://github.com/pytorch/pytorch/issues/144729#issuecomment-2591644076 for anlaysis, as well we have various internal tests to verify

Differential Revision: D68194635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144837
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-01-16 01:12:46 +00:00
926f9056a9 speculation_log: Raise a unique error for divergence issues (#144785)
This is primarily sent for discussion and to see what tests fail due to
this. The idea is that rather than capturing this as a regex on the
fail_reason, just give it a unique failure type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144785
Approved by: https://github.com/ezyang
2025-01-16 00:49:43 +00:00
b90231a189 [inductor][BE] don't try/except ImportError for AttrsDescriptor versions (#144807)
motivation: Ed's advice to avoid `except ImportError` (i.e. based on the fact that your target module/class might in fact exist, but you might run into some different ImportError whose stacktrace you now ignore).

additional motivation: I'm going to add some more cases to this list, and would like to avoid this pattern:
```
try:
   ...
except ImportError:
    try:
        ...
    except ImportError:
        try:
            ...
```

suggestions on better ways to do this would be appreciated!

test: ran with triton commit e5be006a (last working commit) and 34a6a2ff8 (in june, when AttrsDescriptor was still in triton.compiler.compiler)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144807
Approved by: https://github.com/ezyang
2025-01-16 00:32:29 +00:00
cyy
ee97d80be2 Apply Ruff fixes and pyupgrade to torch/jit (#144208)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144208
Approved by: https://github.com/davidberard98
2025-01-16 00:28:50 +00:00
774f21a370 [export] handle buffer/input mutations for joint-graph (#144806)
Summary: previous construction of GraphSignature output specs didn't consider buffer/user input mutations

Test Plan: test_experimental

Differential Revision: D68177409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144806
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2025-01-16 00:22:16 +00:00
d7f45fc575 dynamic shape support for interpolate(antialias=True) backward (#141198)
Fixes https://github.com/pytorch/pytorch/issues/141187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141198
Approved by: https://github.com/ezyang, https://github.com/Chillee
ghstack dependencies: #141161
2025-01-16 00:08:25 +00:00
4831f89790 support numbers as tensors for aten.copy(Tensor, Tensor) (#141161)
Fixes https://github.com/pytorch/pytorch/issues/141149. `aten.copy_` supports numbers as tensors in the python arg parser. So we need to give the same treatment to `aten.copy`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141161
Approved by: https://github.com/ezyang
2025-01-16 00:08:25 +00:00
2645fc45b1 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-15 23:43:41 +00:00
fb4b5a9299 [ONNX] Use python_dispatcher in type promotion (#144801)
Fix #143118

Use python_dispatcher in the type promotion pass to preserve symbolic shapes according to @angelayi 's suggestions. (Thanks!)

Tested locally. I wasn't able to create a minimal repro except for using the full model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144801
Approved by: https://github.com/titaiwangms
2025-01-15 23:25:19 +00:00
7265dc0622 Enable s8s8s8 for qlinear with mkl-dnn (#139887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887
Approved by: https://github.com/huydhn
2025-01-15 23:20:10 +00:00
4e1834f5f3 use cooperative schedule in scaled_mm for fast_accum=false (#144809)
This improves perf for large matrices by more than 2x, more detailed benchmark coming.
On master
![image](https://github.com/user-attachments/assets/fc6a0987-5b82-475d-a2ff-b46641bb17dc)
On this branch
<img width="601" alt="image" src="https://github.com/user-attachments/assets/7f55152b-1110-45e4-b2ea-6f274d543869" />
A plot similar to https://github.com/pytorch/ao/pull/1325#discussion_r1868193786
<details>
  <summary>Benchmarking code:</summary>

```python
import torch
from triton.testing import do_bench
import itertools

def fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False):
    return torch._scaled_mm(a, b.t(), scale_a.view(-1, 1), scale_b.view(1, -1), use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16)

def fn_aten(a, b, scale, use_fast_accum=False):
    return torch._scaled_mm(a, b.t(), scale, scale, use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16)

for i,j,k in itertools.product(range(9, 15), range(9, 15), range(9, 15)):
    m = 2**i
    n = 2**j
    k = 2**k

    a=torch.randn(m, k, device="cuda").to(dtype=torch.float8_e4m3fn)
    b=torch.randn(n, k, device="cuda").to(dtype=torch.float8_e4m3fn)
    scale_a = torch.randint(1, 11, (a.shape[0],), device="cuda", dtype=torch.float32)
    scale_b = torch.randint(1, 11, (b.shape[0],), device="cuda", dtype=torch.float32)
    scale_0 = torch.randn((), device="cuda", dtype=torch.float32)

    ms_rowwise_fast = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=True), warmup=25, rep=50)
    ms_rowwise_slow = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False), warmup=25, rep=50)

    ms_tensor_fast = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=True), warmup=25, rep=50)
    ms_tensor_slow = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=False), warmup=25, rep=50)

    print(f"m={m}, n={n}, k={k}, fast={ms_rowwise_fast}, slow={ms_rowwise_slow}, ratio_tw={ms_tensor_slow /ms_tensor_fast}, ratio_rw={ms_rowwise_slow / ms_rowwise_fast}")

```
</details>

Higher N/K values still have about 40% penalty, perhaps some additional heuristics tweaks would be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144809
Approved by: https://github.com/drisspg
2025-01-15 23:04:14 +00:00
0f051eaf66 Revert "Fix global namespace pollution in ATen/Dispatch.h (#138626)"
This reverts commit 326c7cae28783f29c577b5a5d3ac38a3b61188bd.

Reverted https://github.com/pytorch/pytorch/pull/138626 on behalf of https://github.com/malfet due to This broke inductor tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_torchbench%2C%202%2C%202 ([comment](https://github.com/pytorch/pytorch/pull/138626#issuecomment-2594021436))
2025-01-15 21:59:04 +00:00
Sam
c7b2f7dd14 Add generator parameter to rand*_like functions (#136780)
Fixes #128786
Fixes #101974
Fixes #27072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136780
Approved by: https://github.com/Chillee, https://github.com/ezyang
2025-01-15 21:16:52 +00:00
d62b3979da cpp_wrapper: Move #includes to per-device header files (#143909)
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Differential Revision: [D67938955](https://our.internmc.facebook.com/intern/diff/D67938955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143909
Approved by: https://github.com/desertfire
2025-01-15 21:14:02 +00:00
05095a45f2 Fix the wrong artifact in remaining workflows (#144812)
I missed them in https://github.com/pytorch/pytorch/pull/144694 as they weren't run often.  But they are still failing nonetheless, i.e. https://github.com/pytorch/pytorch/actions/runs/12762640334/job/35578870178

The issue was from https://github.com/pytorch/pytorch/pull/125401 where it added `use-gha: ${{ inputs.use-gha }}` to linux_test workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144812
Approved by: https://github.com/clee2000
2025-01-15 20:36:40 +00:00
b88dcb4835 dynamo: Don't crash when tracing a missing attr on a constant. (#144593)
dynamo: Don't crash when tracing a missing attr on a constant.

This throws a InternalTorchDynamoError: AttributeError: 'NoneType' object has no attribute 'max'
instead of just skipping the bad call when tracing, and throwing a
normal AttributeError instead.

There are two questions that I would love reviewer comment on.
1) Is throwing unimplemented the right thing here? or should I throw
   something like ObservedAttributeError
2) Do we need to worry about performance with this code? In particular,
   should we just catch the exception? Or maybe cache the lookup result?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144593
Approved by: https://github.com/jansel
2025-01-15 20:23:43 +00:00
d812fdd490 fix as_bool serde (#144791)
Differential Revision: [D68167701](https://our.internmc.facebook.com/intern/diff/D68167701/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144791
Approved by: https://github.com/pianpwk
2025-01-15 20:22:26 +00:00
904641769e [MPSInductor] Implement pow() (#144827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144827
Approved by: https://github.com/dcci, https://github.com/jansel
2025-01-15 20:11:34 +00:00
b410378d93 Register nonzero for meta device for FBLSim (#144727)
Summary:
Fix `nonzero is not registered to meta` issue:
```
"NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered".
```

Reviewed By: ezyang

Differential Revision: D66525640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144727
Approved by: https://github.com/ezyang
2025-01-15 19:40:42 +00:00
834086c023 [export] Load side info about pos/kw argument kind for serialization. (#144686)
Summary:
Fixing issue of nodes like
```
torch.ops.aten.linear.default(x, w, b)
```
being deserialized as
```
torch.ops.aten.linear.default(x, w, bias=b)
```
which breaks roundtripping.

Test Plan:
buck test mode/opt caffe2/test:test_export -- -r TestDeserialize
buck test mode/opt caffe2/test:test_export -- -r TestSerialize

Differential Revision: D67991410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144686
Approved by: https://github.com/angelayi
2025-01-15 19:08:38 +00:00
898a90c6bb [dynamo][hop] Introduce FlexAttentionBackwardHighOrderVariable (#144533)
FIXES https://github.com/pytorch/pytorch/issues/143180

This PR adds a new variable mapping to SourcelessBuilder to represent the flex attention intermediates. The variable proxies a call to HOP, and carryovers the graph state (subgraphs represented as UnspecializedNNModuleVariable) to the dynamo output graph. This is safe to do because the nn modules used in flex attention have either been speculated on before, or are outputs of make_fx of the forward.

tlparse of `TestCompiledAutograd.test_flex_attention`: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpiWendk/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

```python
class GraphModule(torch.nn.Module):
    def forward(self, L_inputs_ : list):
         ...
         # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1)
        ...
        fw_graph0_0 = self.fw_graph0_0
        joint_graph0_0 = self.joint_graph0_0
        mask_graph0_0 = self.mask_graph0_0
        flex_attention_backward = torch.ops.higher_order.flex_attention_backward(aot0_primals_1, aot0_primals_1, aot0_primals_1, aot0_detach_3, aot0_detach_5, aot0_expand_5, aot0_zeros_1, fw_graph0_0, joint_graph0_0, (1, 1, aot0_ones, aot0_zeros, None, None, aot0__to_copy_1, aot0__to_copy_2, None, None, 1073741824, 1073741824, mask_graph0_0), 0.125, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True}, (), ());  aot0_primals_1 = aot0_detach_3 = aot0_detach_5 = aot0_expand_5 = aot0_zeros_1 = fw_graph0_0 = joint_graph0_0 = aot0_ones = aot0_zeros = aot0__to_copy_1 = aot0__to_copy_2 = mask_graph0_0 = None
        aot0_getitem_4: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[0]
        aot0_getitem_5: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[1]
        aot0_getitem_6: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[2];  flex_attention_backward = None
        ...

    class fw_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0"):
            return arg0_1

    class joint_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0", arg5_1: "bf16[][]cuda:0"):
            return [arg5_1, None, None, None, None]

    class mask_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "i32[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0"):
             # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1)
            new_ones: "b8[][]cuda:0" = torch.ops.aten.new_ones.default(arg0_1, [], dtype = torch.bool, device = device(type='cuda', index=0), pin_memory = False);  arg0_1 = None
            return new_ones

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144533
Approved by: https://github.com/zou3519
2025-01-15 18:40:57 +00:00
eqy
a6763b7b81 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-15 18:37:55 +00:00
6ac0616504 [ROCm] hipblaslt rowwise f8 gemm (#144432)
hipblaslt added rowwise f8 gemm support.  Integrate with scaled_mm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144432
Approved by: https://github.com/drisspg
2025-01-15 18:23:44 +00:00
069419569d [PagedAttention] Support different input position for each batch index (#144693)
In LLM inference, each request usually has different prefill length, leading to different input position for each batch index. This PR adds such support for paged attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144693
Approved by: https://github.com/drisspg
2025-01-15 18:03:52 +00:00
7e80758efc [CUDAGraph][Docs] add cuda to torch.randn (#144793)
Previous doc example created `torch.randn` tensor on cpu so CUDAGraph was skipped.

Fixes #144386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144793
Approved by: https://github.com/eellison
2025-01-15 18:02:10 +00:00
ee8f833d13 Undo leading underscore on ctx for breakpoint (#144864)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144864
Approved by: https://github.com/Skylion007
2025-01-15 18:00:58 +00:00
443de667b1 Revert "Enable s8s8s8 for qlinear with mkl-dnn (#139887)"
This reverts commit dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d.

Reverted https://github.com/pytorch/pytorch/pull/139887 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to have broken trunk. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/12788709683/job/35651699934) [HUD commit link](dc8692b0eb) ([comment](https://github.com/pytorch/pytorch/pull/139887#issuecomment-2593597977))
2025-01-15 17:58:33 +00:00
d065e8a9de [ez] add lint commits to .git-blame-ignore-revs (#144576)
Test Plan: Ran git blame on .lintrunner.toml and github's linter (+ manual testing) shows all commits exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576
Approved by: https://github.com/janeyx99
2025-01-15 17:39:29 +00:00
c07dc64017 Update pin memory related APIs to not pass 'device' argument (#131858)
Based on https://github.com/pytorch/pytorch/pull/126376, this PR tries to update all PT callers (e.g., `Tensor.is_pinned()`, `Tensor.pin_memory()`) to not pass `device` argument.
As for `storage/untyped_storage.is_pinned()/pin_memory()`, we keep the `device` argument but passing `device` is discouraged. And if not given, the default `device` is still 'cuda' for BC.
Additionally, based on device-agnostic pin_memory, `pin_memory_device` argument of `torch.utils.data.DataLoader` is discouraged  now. For BC, explictly passing this argument is still effective. If not given, the default `device` will be the current accelerator.

Fixes #124908
Relates https://github.com/pytorch/pytorch/pull/126376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131858
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-15 17:23:35 +00:00
0dca756832 Revert "Upload METADATA file with whl binaries (#143677)" (#144706)
This reverts commit 3eb3f4ed5580010a7961d996ccc6ee19c7ccbb5e.

Also reverts https://github.com/pytorch/pytorch/pull/144164

Manual revert because the above causes merge conflicts

Reverting in favor of https://github.com/pytorch/test-infra/pull/6159
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144706
Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet
2025-01-15 17:20:21 +00:00
d782e46a36 [BE] typing for decorators - library (#138969)
Test Plan: unit tests

Differential Revision: D62302678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138969
Approved by: https://github.com/zou3519
2025-01-15 17:08:55 +00:00
c7a9599100 Handle meta tensors in FX quantization (#144726)
Summary:
D66895899 got reverted in D67565250 because of pytorch OSS linter failure.
Adding back with the format the linter suggested
https://github.com/pytorch/pytorch/actions/runs/12443655335/job/34743090791

Test Plan: buck run fbcode//mode/dev-nosan fbcode//torchrec/fb/quant/tests:test_embedding_modules

Reviewed By: emlin

Differential Revision: D68132568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144726
Approved by: https://github.com/iamzainhuda, https://github.com/janeyx99
2025-01-15 16:49:43 +00:00
2bc18a9055 restore rng generation for fbcode (#144819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819
Approved by: https://github.com/malfet, https://github.com/kit1980
2025-01-15 16:34:25 +00:00
154185dcd0 Revert "Removed unused _RequiredParameter (#144771)"
This reverts commit 6a5f895e549665a6895c84881a35736677071048.

Reverted https://github.com/pytorch/pytorch/pull/144771 on behalf of https://github.com/malfet due to It broke number of cpuinductor tests ([comment](https://github.com/pytorch/pytorch/pull/144771#issuecomment-2593293542))
2025-01-15 15:51:33 +00:00
7c52c97a65 Expose several APIs to public (torch python APIs) (#144525)
Fixes #144302
Try to expose several APIs to public for privateuse1 scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144525
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-01-15 14:34:45 +00:00
dc8692b0eb Enable s8s8s8 for qlinear with mkl-dnn (#139887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/ng-05, https://github.com/digantdesai
2025-01-15 12:51:21 +00:00
7e1c1e65eb Graph freezing preparation for non-Inductor backends (#139902)
Enable preparing module named parameters and buffers in tracing context for non-Inductor backends to implement graph freezing.

Fixes #139272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139902
Approved by: https://github.com/eellison, https://github.com/masnesral, https://github.com/gujinghui
2025-01-15 11:25:04 +00:00
62ce3e6e84 refresh benchmarks results after recent recent regressions (#143075)
refresh data after !5 regression by https://github.com/pytorch/pytorch/pull/144319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143075
Approved by: https://github.com/bobrenjc93, https://github.com/huydhn
2025-01-15 09:11:57 +00:00
e263f0af23 [BE] Make a SymbolInfo NamedTuple (#144745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144745
Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007
2025-01-15 08:59:27 +00:00
d9d7cca009 make eval_frame safe (#141357)
Fixes #108942

this PR converts eval_frame.c's static extension types to heap types, making it thread and sub-interpreter safe.

the current modification only showcases one state variable being lifted, but there are opportunities for other variables that can be addressed in this PR

todo / suggestions:

1. uplift `eval_frame_callback_key` to module state
2. define `.m_slots` to module definition so initialization is within python's module lifecycle rather than an explicit `torch_c_dynamo_eval_frame_init`
3. define configurations for module allowing sub-interpreters or not

```c
static int module_exec(PyObject *m) {}

static PyModuleDef_Slot module_slots[] = {
    {Py_mod_exec, module_exec},
    {0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT,
     ....
    .m_slots = module_slots
};
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141357
Approved by: https://github.com/jansel

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2025-01-15 07:37:50 +00:00
6ba53a5f1c [AMD] De-noise tf32 warnings (#144797)
Summary: This is way too noisy especially during unit tests. So just log once.

Test Plan: OSS CI. Tested on a unit test and now I only see one line (hard to notice :) ).

Differential Revision: D68167633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144797
Approved by: https://github.com/jianyuh, https://github.com/leitian, https://github.com/yoyoyocmu
2025-01-15 07:10:38 +00:00
69b883d7ac Remove C10_EMBEDDED (#144808)
I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are *multiple* implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808
Approved by: https://github.com/janeyx99, https://github.com/malfet
2025-01-15 06:08:53 +00:00
b801210035 Restore support for other types of async_compile pools (spawn, fork) (#144491)
Summary: https://github.com/pytorch/pytorch/pull/142001 removed support for process pools other than "subprocess", but some OSS users still find it useful; put it back.

Test Plan: New unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144491
Approved by: https://github.com/jansel, https://github.com/haifeng-jin
2025-01-15 06:04:49 +00:00
326c7cae28 Fix global namespace pollution in ATen/Dispatch.h (#138626)
Summary:

Was it a typo? Since we already have `at::detail::record_kernel_function_dtype()` in `ATen/Dispatch.h`

Test Plan: just build

Differential Revision: D64642080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138626
Approved by: https://github.com/malfet
2025-01-15 05:43:54 +00:00
7d71ddbe5d Add non_c_binding torch functions to allowlist for AOTAutogradCache, confirm no special handlers for them (#144802)
Differential Revision: [D68173093](https://our.internmc.facebook.com/intern/diff/D68173093/)

This diff allows any function in torch_non_c_binding_in_graph_functions to be safe to cache. These functions should be safe to cache because they are part of the torch API, and do not save global state (or if they do, dynamo creates unique guards around the constants they return).
A function that's allowed in a dynamo graph is safe to cache for AOTAutograd purposes as long as:
- It's functional (i.e. does not access global state);
- or its value is constant folded away (and guarded against by dynamo)

The tricky cases are functions that dynamo uses special handlers to track. These special handlers can sometimes close over stuff that's safe for dynamo locally, but isn't encoded anywhere when cached across processes. An example of this is `DTensor.from_local`, where various DeviceMesh information doesn't change in the same dynamo process, but can change across multiple processes. The handler for `DTensor.from_local` closes over these and dynamo creates a proxy for the function call. This is not safe to cache.

That said, most special handlers are in fact functional and safe. So I add a unit test to test_trace_rules.py that confirms that any function with special handlers in dynamo added to this list needs to be audited to be safe to cache.

The list of safe handlers there either:
- Don't access global state;
- Guard on global state; or
- Always returns a constant that never changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144802
Approved by: https://github.com/bdhirsh
2025-01-15 05:41:36 +00:00
79312ddb73 [PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702)
There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble.

This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702
Approved by: https://github.com/kwen2501
2025-01-15 05:35:29 +00:00
ae7df51232 [c10d] Fix CudaEventCache for dangling references (#144496)
Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it.
1. We add a unit test to repro the issue mentioned in the issue.
2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496
Approved by: https://github.com/kwen2501
2025-01-15 05:11:48 +00:00
9cd6f46130 [ca] raise error message on AOT Autograd caching (#144595)
FIXES https://github.com/pytorch/pytorch/issues/144175, bandaid

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144595
Approved by: https://github.com/bdhirsh
2025-01-15 05:09:42 +00:00
e0bbff6019 [c10d][ez] Add comments to the end of Macro for better readability (#144789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144789
Approved by: https://github.com/c-p-i-o
2025-01-15 05:06:41 +00:00
d2ca8163c0 [MPSInductor] Support abs in MetalPrintExpr (#144826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144826
Approved by: https://github.com/dcci
ghstack dependencies: #144509, #144798, #144795, #144796
2025-01-15 05:01:25 +00:00
9610a22e94 Fix FakeTensor device creation for MPS (#144796)
By promoting torch.device("mps") to `torch.device("mps:0")`, but skipping `is_initialized` check, as MPS does not really support multi-GPU right now

This fixes `GPUTests.test_remove_no_ops_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144796
Approved by: https://github.com/ezyang
ghstack dependencies: #144509, #144798, #144795
2025-01-15 05:01:25 +00:00
18786c65e5 [BE] Extend test_remove_no_ops (#144795)
----

- Use `is_dtype_supported` to skip dtype promotions portion of the test on unsupported device
- Extend it to use `torch.float16` so promotions could be checked there
- Implement `CpuInterface.is_bfloat16_supported` that returns true (which looks like the case, even if it's supported via emulation)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144795
Approved by: https://github.com/Skylion007
ghstack dependencies: #144509, #144798
2025-01-15 05:00:26 +00:00
48f7e7c378 [torch][ao][EASY] Change print to log in numeric debugger to avoid large output (#144790)
Summary:
This print statement was spewing a bunch of data in logs by default, but it should
be silenceable.
Use `log.debug` instead.

Differential Revision: D68166823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144790
Approved by: https://github.com/tarun292
2025-01-15 04:58:56 +00:00
6a5f895e54 Removed unused _RequiredParameter (#144771)
As per this [discussion](https://discuss.pytorch.org/t/a-question-about-requiredparameter/137977), I figured that `_RequiredParameter` is no longer used.

The `required` object was initially introduced in this [PR](4db6667923) as the `SGD` optimizer did not offer a default value for the learning rate. However there isn't a single place in the code base using `_RequiredParameter`, nor `required`. I am therefore removing unused `_RequiredParameter` and `required`.

Everything not included in this PR is Not a Contribution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144771
Approved by: https://github.com/janeyx99
2025-01-15 04:11:17 +00:00
cyy
d87aad6877 [5/N] Apply Ruff fixes and pyupgrade to Python 3.9 (#144205)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144205
Approved by: https://github.com/albanD
2025-01-15 04:00:47 +00:00
db787181b5 Back out "[Submodule] Upgrade to Cutlass 3.6" (#144738)
Summary: Revert due to perf regressions see: https://github.com/pytorch/pytorch/issues/144729

Test Plan: sand castle

Differential Revision: D68137326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144738
Approved by: https://github.com/huydhn
2025-01-15 02:57:14 +00:00
e2251fffbb [MPSInductor] Add min/max to MetalExprPrinter (#144798)
After that `GPUTests::test_avg_pool2d8_mps` and `GPUTests::test_avg_pool2d5_mps` passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144798
Approved by: https://github.com/dcci
ghstack dependencies: #144509
2025-01-15 01:43:42 +00:00
9199c79a9c [Quant][Inductor][X86] Separate unary post op fusion and lowering for qconv (#144312)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves unary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise`
2. Fuse `onednn.qconv2d_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144312
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #144224
2025-01-15 00:50:54 +00:00
825fe15024 EZ fix to make sure local pytest run succeeds in export (#144764)
Previously run_tests() was protected under IS_FBCODE flag so that following works:
```
python test/export/test_export_legacy.py
```

But it fails on:
```
pytest test/export/test_export_legacy.py
```

This is because pytest doesn't seem to get triggered through run_tests().

Differential Revision: [D68152737](https://our.internmc.facebook.com/intern/diff/D68152737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144764
Approved by: https://github.com/avikchaudhuri
2025-01-15 00:43:40 +00:00
8c2aa0c533 [cutlass backend] cexpr the arg before writing to cpp file (#144714)
Summary: The problem is for certain shapes, see unit test, one of the dimensions is like `s0 // 2`. If we use cutlass backend, this means writing that to C++ file, which would lead to C++ compilation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144714
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78, https://github.com/desertfire
2025-01-14 23:09:44 +00:00
8ad37ed710 Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483
Approved by: https://github.com/Skylion007
2025-01-14 22:32:51 +00:00
ea3395e4f2 [ROCm] Improvements for vectorized elementwise kernels (#143269)
*  Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
   * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143269
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2025-01-14 22:09:21 +00:00
c000214826 Allow GradientEdge as torch.autograd.backward outputs (#144744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144744
Approved by: https://github.com/albanD
2025-01-14 21:31:44 +00:00
64829b356a [PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325)
PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend.

However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling  `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw:
```
could not parse dispatch key 'my_backend'
```

So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325
Approved by: https://github.com/albanD
2025-01-14 21:21:29 +00:00
130452dad6 [Pipelining] fix test_schedule.py (missing destroy_process_group (#144734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144734
Approved by: https://github.com/H-Huang
ghstack dependencies: #144352, #144596
2025-01-14 21:16:09 +00:00
aa57f0c663 [Pipelining] Refactor common utils from test_pp_dp (#144596)
Split test_pp_dp into pp_ddp and pp_fsdp so its a bit more
concise and easier to add CP to the FSDP one.

Realize that 'use_new_runtime' parametrization was not even being used,
removing it saves a bunch of test time. We should migrate schedules to
the new runtime and have them be covered that way.  (And
test_schedule*.py are testing new runtime too).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144596
Approved by: https://github.com/H-Huang
ghstack dependencies: #144352
2025-01-14 20:13:17 +00:00
6f5dce3035 [Pipelining] Fix PP grad scaling (#144352)
Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches.

Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352
Approved by: https://github.com/H-Huang
2025-01-14 20:13:17 +00:00
9157a748a6 [MPSInductor] Add dummy properties (#144509)
For compute capabilitiy (which is an empty string, same as CPU)
And for multicore count return 8, as this is smallest number of GPU cores on Apple silicon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144509
Approved by: https://github.com/jansel
2025-01-14 20:12:38 +00:00
bdd942efd7 Revert "Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138)"
This reverts commit 6cfc08167595e27ee9a5701c6426a7a8a7e387ef.

Reverted https://github.com/pytorch/pytorch/pull/144138 on behalf of https://github.com/albanD due to This seems to impact the caffe2 code ([comment](https://github.com/pytorch/pytorch/pull/144138#issuecomment-2590891200))
2025-01-14 19:04:12 +00:00
b4b4e57469 [CD] Enable profiling for XPU Windows nightly wheels (#144316)
PR https://github.com/pytorch/pytorch/pull/144034 added profiling support for torch XPU Windows binary, enable it in PyTorch XPU Windows CD
Works for https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144316
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-01-14 19:01:27 +00:00
2683691237 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-14 18:47:42 +00:00
e2891d43a8 [codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +1 (#144783)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144783
Approved by: https://github.com/albanD, https://github.com/malfet
2025-01-14 18:34:54 +00:00
ec1c3ab3b2 [inductor][triton] skip test_data_type_propagation if triton (#142054)
None cpp inductor backends don't have a `DataTypePropagation` pass on the scheduler nodes so skip the test. CUDA only passes because the device is currently not changed to "cuda" in the test body.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142054
Approved by: https://github.com/eellison
2025-01-14 18:03:00 +00:00
e666807653 [Fix]: Enable support for Arm Neon & SVE support for FP32 Gemm Wrapper (#144327)
**Performance Improvements**:
Linear Layer [ 1x512 * 512x512 ] ->  2x - 4x
Linear Layer [ 3x512 * 512x512 ] -> 2x - 4x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144327
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/cfRod, https://github.com/malfet

Co-authored-by: Crefeda Rodrigues <crefeda.Rodrigues@arm.com>
2025-01-14 17:52:00 +00:00
eee7a47e94 Support FunctionalTensor subclass in is_fake and maybe_get_fake_mode (#144719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144719
Approved by: https://github.com/bdhirsh
2025-01-14 17:49:11 +00:00
d21738f24a Revert "Fix torch.normal ignores default_device (#144070)"
This reverts commit 184549b2d7e59acfc6e47d121e9ebb50648945b3.

Reverted https://github.com/pytorch/pytorch/pull/144070 on behalf of https://github.com/ezyang due to broken a specific use case ([comment](https://github.com/pytorch/pytorch/pull/144070#issuecomment-2590681953))
2025-01-14 17:41:58 +00:00
7977a3638e [executorch hash update] update the pinned executorch hash (#140769)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140769
Approved by: https://github.com/pytorchbot
2025-01-14 17:38:07 +00:00
f2975717f3 [CD] Fix slim-wheel nvjit-link import problem (#141063)
When other toolkit (say CUDA-12.3)  is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with
```
ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
```
It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following
```python
        if version.cuda in ["12.4", "12.6"]:
            with open("/proc/self/maps") as f:
                _maps = f.read()
            # libtorch_global_deps.so always depends in cudart, check if its installed via wheel
            if "nvidia/cuda_runtime/lib/libcudart.so" in _maps:
                # If all abovementioned conditions are met, preload nvjitlink
                _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]")
```

Fixes https://github.com/pytorch/pytorch/issues/140797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141063
Approved by: https://github.com/kit1980

Co-authored-by: Sergii Dymchenko <sdym@meta.com>
2025-01-14 17:33:07 +00:00
5c727d5679 [minifier] Fix config generator for callables (#144518)
Summary:
When config contains callables, the current configs generated cannot be run:

```
torch._dynamo.config.reorderable_logging_functions = {<built-in function print>, <function warning at 0x7f774c595630>, <function log at 0x7f774c595870>, <function error at 0x7f774c595510>, <function info at 0x7f774c595750>, <built-in function warn>, <function exception at 0x7f774c5955a0>, <function debug at 0x7f774c5957e0>, <function critical at 0x7f774c5953f0>}
```

We fix the config to generate the right string, so the config is runnable, like below

```
import logging
import warnings
torch._dynamo.config.reorderable_logging_functions = { warnings.warn, logging.warn, print }
```

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config
```

Differential Revision: D67998703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144518
Approved by: https://github.com/desertfire
2025-01-14 17:18:13 +00:00
cbb1ed2966 [1/N] OpenReg: Replace open_registration_extension.cpp with openreg (#141815)
As described in OpenReg [next-steps](https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/README.md#next-steps), here we replace the current `open_registration_extension.cpp` test in PyTorch CI with openreg.

The current `open_registration_extension.cpp` contains two parts:
1. Implentations to support `PrivateUse1` backend.
2. Helper functions used for UTs in `test_cpp_extensions_open_device_registration.py` and `test_transformers.py`.

For the first part, we'll replace it with openreg. For the second part, we'll migrate them to ut files step by step.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141815
Approved by: https://github.com/albanD
2025-01-14 15:59:00 +00:00
347a74b8f5 Mark CUDA-12.6 as experimental for 2.6 release (#144769)
Because that's the first time we are trying to release it, and it also is the first release to use manylinux2_28
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144769
Approved by: https://github.com/atalman
2025-01-14 15:30:00 +00:00
60d2e32fa4 [BE] Remove lambda from str (#144743)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144743
Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007
ghstack dependencies: #144471
2025-01-14 15:10:57 +00:00
ffb3f32693 Add max kwarg to torch._check with alternate size oblivious semantics (#144471)
Fixes https://github.com/pytorch/pytorch/issues/120288 for the static bound case

I had been tying myself in knots in the original issue about the fact that we can't really do symbolic bounds like u0 < s0. But then I realized, "Wait, but the static bounds are easy!" So this makes it so you can also exclude a specific upper bound when doing size oblivious tests, which is enough to solve https://github.com/pytorch/pytorch/issues/123592#issuecomment-2574556708

It's written very dirtily, maybe there's some cleanup. Bikeshed on the public API name also welcome.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144471
Approved by: https://github.com/avikchaudhuri
2025-01-14 15:10:57 +00:00
95b41d2aa4 Tests Generelization for multiple accelerator devices (#139749)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Chnages: There are general changes in common_dtesnor module for device type generalization so that tests can be executed on non cuda devices too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139749
Approved by: https://github.com/kwen2501
2025-01-14 08:52:46 +00:00
1800f5f461 Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735)
**Motivation:**

- Enable coalescing path on XPU for `batch_isend_irecv`.
- If XCCL backend is specified, then construct a XPU tensor to ensure `barrier` dispatch to XCCL backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143735
Approved by: https://github.com/kwen2501
2025-01-14 08:37:48 +00:00
21cbee5d9b Drop unused num_elements variable (#144723)
Summary:
With the recent enforcement of unused variable as an error in D67329035, certain tests like
https://www.internalfb.com/intern/test/562950135258426?ref_report_id=0
can't build citing:
```
Action failed: fbcode//caffe2:libtorch_cuda (cfg:linux-x86_64-fbcode-platform010-clang17-no-san#2a7259832b2f5c67) (cxx_compile torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (pic))
Remote command returned non-zero exit code 1
Remote action, reproduce with: `frecli cas download-action a95a6625d2b071a782a7a8ea2882f4adccf103b023df5ccb596f48c506101754:145`
Stdout: <empty>
Stderr:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3757:16: error: unused variable 'num_elements' [-Werror,-Wunused-variable]
 3757 |         size_t num_elements = output.numel();
      |                ^~~~~~~~~~~~
1 error generated.
```
This causes Sandcastle to turn off these tests, decreasing protection from other bad diffs. Clean up the unused variable to unblock.

Test Plan:
```
buck2 build --config hpc_comms.use_ncclx=dev --flagfile fbcode//mode/opt fbcode//ftar:ftar_py_e2e_test
```

https://www.internalfb.com/buck2/888dfc68-07eb-4ba1-add5-b38c12d52b33

Reviewed By: c-p-i-o

Differential Revision: D68126236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144723
Approved by: https://github.com/fduwjj, https://github.com/Skylion007

Co-authored-by: Daulet Askarov <dauleta@meta.com>
2025-01-14 08:29:01 +00:00
80eff6e720 [MPS] fix triangular for >3D tensors (#144545)
Old implementation leads to incorrect output due to not handling the other batch sizes other than 3D tensors(B, M, N)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144545
Approved by: https://github.com/malfet
2025-01-14 08:25:01 +00:00
8436a5c2cb [Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves binary post op fusion of qlinear out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise`
2. Fuse `onednn.qlinear_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-01-14 06:46:38 +00:00
c031defe0b [RELAND] Generalize at::manual_seed for all accelerators (#144370)
# Additional Context
This is a reland PR originated from eeb57394f93d720bca498c3fa9d167fc7b9cca46

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2025-01-14 06:09:36 +00:00
9d98b66e7b [Inductor][CPP] Enable Epilogue Fusion for Grouped GEMM Template (#143897)
**Summary**
In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it.

**Fusion**

- The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM.
- During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design.
- In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles.

**Code Gen**
We maintain a list of epilogues and codegen it one by one.

- If any of the GEMM has bias, we create  a extra `bias_add` epilogue and prepend it at first of the epilogue list.
- If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #143796
2025-01-14 06:07:50 +00:00
25de671ea8 [Inductor][CPP] Enable Grouped GEMM Template (#143796)
**Summary**
Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012

- Support flexible number of GEMMs
- Share activation across GEMMs
  - The Grouped GEMM Template supports independent activations
  - However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs
- Each GEMM can have a unique weight but same sizes
- Each GEMM can have a unique bias or None
  - Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR
- Each GEMM have its own epilogues
  - Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid
python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear
```

**Example**
Here is the example and generated code
```
batch_size = 4
in_features = 512
out_features = 1024
dtype = torch.bfloat16

class M(torch.nn.Module):
    def __init__(self, bias):
        super().__init__()
        self.linear0 = torch.nn.Linear(in_features, out_features, bias=False)
        self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)

    def forward(self, x):
        return self.linear0(x), self.linear1(x)

if __name__ == "__main__":
    with torch.no_grad():
        input = torch.randn(batch_size, in_features, dtype=dtype)
        m = M(bias=bias).to(dtype=dtype).eval()
        cm = torch.compile(m)
        act_res = cm(input)
```

Generated Code:  https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py

**Next Step**

- Support Epilogue fusion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796
Approved by: https://github.com/jgong5, https://github.com/jansel
2025-01-14 05:59:07 +00:00
35b46a75f1 [mps/inductor] Add support for round() (#144731)
With this change, inductor/test_view_on_aliased passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144731
Approved by: https://github.com/malfet
2025-01-14 05:56:13 +00:00
17e05cde0c ROCm: Skip tests in elastic/utils/distributed_test (#144692)
The tests are failing on ROCm machines due to the below error. The client socket has timed out after 1000ms while trying to connect to (gpu4f67.jax.cs.cpe.ice.amd.com, 0)
Disabling the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144692
Approved by: https://github.com/jeffdaily
2025-01-14 03:49:06 +00:00
e58c823ab8 Implement increment and add_to_set for CompileEventLogger (#143427)
This diff implements `increment` and `add_to_set`, which are features of MetricsContext, but not ChromiumEventLogger. This allows us to add a bunch of other metricscontext callsites to use CompileEventLogger instead.

Differential Revision: [D67354867](https://our.internmc.facebook.com/intern/diff/D67354867/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143427
Approved by: https://github.com/masnesral
2025-01-14 02:42:49 +00:00
6053242890 [CD] Enable python3.13t builds for aarch64 (#144698)
But make sure that right numpy version is picked (2.0.2 does not support 3.13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144698
Approved by: https://github.com/atalman
ghstack dependencies: #144696, #144697, #144716
2025-01-14 02:29:01 +00:00
b221f88fc1 Leave SCCACHE_S3_KEY_PREFIX empty to share the cache among all build jobs (#144704)
This is a follow-up of https://github.com/pytorch/pytorch/pull/144112#pullrequestreview-2528451214.  After leaving https://github.com/pytorch/pytorch/pull/144112 running for more than a week, all build jobs were fine, but I failed to see any improvement in build time.

So, let's try @malfet suggestion by removing the prefix altogether to keep it simple.  After this land, I will circle back on this to see if there is any improvements.  Otherwise, it's still a simple BE change I guess.

Here is the query I'm using to gather build time data for reference:

```
with jobs as (
    select
        id,
        name,
        DATE_DIFF('minute', created_at, completed_at) as duration,
        DATE_TRUNC('week', created_at) as bucket
    from
        workflow_job
    where
        name like '%/ build'
        and html_url like concat('%', {repo: String }, '%')
        and conclusion = 'success'
        and created_at >= (CURRENT_TIMESTAMP() - INTERVAL 6 MONTHS)
),
aggregated_jobs_in_bucket as (
    select
        --groupArray(duration) as durations,
        --quantiles(0.9)(duration),
        avg(duration),
        bucket
    from
        jobs
    group by
        bucket
)
select
    *
from
    aggregated_jobs_in_bucket
order by
    bucket desc
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144704
Approved by: https://github.com/clee2000
2025-01-14 02:19:38 +00:00
6d56277682 [export] Fix torchbind constant folding (#144684)
Summary: `CallTorchBind` should not be folded during constant folding

Test Plan:
```
buck2 run mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding_torchbind
```

Reviewed By: henryoier

Differential Revision: D67721272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144684
Approved by: https://github.com/zhxchen17
2025-01-14 01:58:44 +00:00
eaa8a97b39 [RelEng] Add --ami option to build_aarch64 (#144685)
Which should be mutually-exclusive with OS

For example, one can use the following to alloc one-off instance
```
./build_aarch64_wheel.py --alloc-instance  --instance-type g5.4xlarge --key-name nshulga-key --ami ami-0f51103893c02957c --ebs-size 200
```

TODO:
 - Figure out EBS volume name depending on the AMI (for `ami-05576a079321f21f8`(al2023) it's `/dev/xvda`, but for `ami-0f51103893c02957c`(deep learning container) it's `/dev/sda1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144685
Approved by: https://github.com/atalman
2025-01-14 01:48:27 +00:00
de9d6a25d7 [mps/inductor] Add support for ceil (#144715)
inductor/test_index_dynamic_shapes passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144715
Approved by: https://github.com/malfet
2025-01-14 01:16:47 +00:00
64bcf39180 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit 388b75edec09182131be0dfe1abeafc5c3b91adf.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))
2025-01-14 00:48:28 +00:00
dfe06e555d Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)"
This reverts commit dcc04e9237292de10e9cedd8213253e253b1e91c.

Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/kit1980 due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/144441 ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2588515018))
2025-01-14 00:46:48 +00:00
58302c4eaa [BE] [CD] Remove pygit2 dep for aarch64_wheel build (#144716)
As it's incompatible with 3.13t and only used to fetch the branch name, which could be done by running
```
git rev-parse --abbrev-ref HEAD
```

Also, remove yet another reference to long gone `master` branch.

Test plan:
  Download `manywheel-py3_11-cpu-aarch64.zip` produced by this PR, install it inside docker container and check it's version
```
# pip install torch-2.7.0.dev20250113+cpu-cp311-cp311-manylinux_2_28_aarch64.whl
...
Installing collected packages: mpmath, typing-extensions, sympy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch
Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.12.0 jinja2-3.1.5 mpmath-1.3.0 networkx-3.4.2 sympy-1.13.1 torch-2.7.0.dev20250113+cpu typing-extensions-4.12.2
root@434f2540345e:/# python
Python 3.11.9 (main, Aug  1 2024, 23:33:10) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.7.0.dev20250113+cpu'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144716
Approved by: https://github.com/atalman
ghstack dependencies: #144696, #144697
2025-01-14 00:43:46 +00:00
dcc04e9237 Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483
Approved by: https://github.com/Skylion007
2025-01-13 23:19:44 +00:00
c15d6508bd Binary builds Docker images - remove cuda 12.1 (#144575)
Remove cuda 12.1 from manylinux, libtoch and almalinux builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144575
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/malfet, https://github.com/Skylion007
2025-01-13 22:44:59 +00:00
4f74864c94 Revert "[AOTI] Add a boxed_run API (#142213)"
This reverts commit 868984c3e324dedeac04cf10e2bbfbf912dac3b1.

Reverted https://github.com/pytorch/pytorch/pull/142213 on behalf of https://github.com/kit1980 due to breaking lots of internal builds, see D68036023 ([comment](https://github.com/pytorch/pytorch/pull/142213#issuecomment-2588378262))
2025-01-13 22:43:47 +00:00
a54a784b82 [dynamo][dicts] Consolidate dict(..) construction (#144342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342
Approved by: https://github.com/StrongerXi
2025-01-13 22:24:56 +00:00
0373cd9950 remove allow-untyped-defs from torch/distributed/checkpoint/api.py (#144653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144653
Approved by: https://github.com/Skylion007
2025-01-13 21:57:19 +00:00
1dab79470d c10::string_view -> std::string_view in pytorch (#143591)
Test Plan: Sandcastle

Differential Revision: D67312322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143591
Approved by: https://github.com/malfet
2025-01-13 21:44:05 +00:00
5129d6ef51 Fix inductor periodic smoke test wrong artifact (#144694)
I'm not entirely sure why this failure starts to show up in periodic since Friday https://github.com/pytorch/pytorch/actions/runs/12716967189/job/35463656803.  The artifact was uploaded to S3, but `use-gha: anything-non-empty-to-use-gh` was set and it was working.  Maybe this is related to https://github.com/pytorch/pytorch/issues/144479

I also clean up the GCP/AWS A100 selection logic as the GCP cluster doesn't exist anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144694
Approved by: https://github.com/clee2000
2025-01-13 21:42:39 +00:00
e15f91337b [inductor] Add unbacked symints binding in ShapeProp (#144605)
Summary: ShapeProp  doesn't know how to propagate unbacked. Patch it up to propagate unbacked symints like PropagateUnbackedSymInts.

Test Plan:
```
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r test_shape_prop_unbacked_sym
```

Differential Revision: D68050073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144605
Approved by: https://github.com/guowentian, https://github.com/pianpwk
2025-01-13 21:30:20 +00:00
3c55669b88 Enable grep_linter to use -a (#144589)
Lintrunner can only apply changes (-a) if only one suggestion is made per file.  The grep_linter makes a suggestion for every line it finds incorrect, so it creates multiple suggestions per file if there are multiple lines that it wants to change

This sets the `line` parameter of the LintMessage to None for all of grep_linter, but I'm not sure if that entry did anything

I'm not sure if enabling -a is the best idea, since its currently used for tabs and tab width might differ each time?  I had one instance where running with -a cause the spacing to change.  On the other hand, -a would have already worked if only one line was bad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144589
Approved by: https://github.com/huydhn
2025-01-13 21:18:24 +00:00
91dbd7b75c [BE]: Improve typing inference with TypeIs (#144682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144682
Approved by: https://github.com/albanD

Co-authored-by: Aaron Orenstein <aorenste@meta.com>
2025-01-13 21:14:31 +00:00
4ceca4d60f [dynamo] Avoid graph break on updates to obj.__dict__ (#144419)
`obj.__dict__` is handled specially in Dynamo, and prior to this patch
we only support read and membership check on that dictionary object.

This patch adds support for writes and some documentation.

Fixes #143756.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-01-13 21:04:10 +00:00
684d015c2f [AOTI] Support _int_mm (#144571)
Summary: Add _int_mm to the C shim, to resolve a torchao issue, https://github.com/pytorch/ao/pull/1531#issue-2776827015

Differential Revision: [D68030385](https://our.internmc.facebook.com/intern/diff/D68030385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144571
Approved by: https://github.com/yushangdi
2025-01-13 20:32:29 +00:00
b7f95df65b [Feat]: Add Multithreading support for kleidiai groupwise GEMM kernels (#144074)
KleidiAI Groupwise GEMM Kernel was not 2D Blocked. This change adds supports for 2D blocking of GEMM kernel to efficiently split workload & speedup GEMM kernel over multiple threads.

Performance improvements:
7B model Pre-fill  speedup from 145 t/s to 175 t/s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144074
Approved by: https://github.com/digantdesai
2025-01-13 20:32:23 +00:00
5a2e8fce9d Fix block pointer test module for triton CPU and add to CI (#144474)
- Fix for BlockPointerTestBase._discontiguous_tensor. It defaults to constructing CUDA tensors, causing a failure if CUDA is not available.
- Add test module to CI to prevent errors like the above from occurring.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144474
Approved by: https://github.com/jansel
2025-01-13 20:25:05 +00:00
80c286cbec remove allow-untyped-defs from torch/_C/_dynamo/eval_frame.pyi (#144655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144655
Approved by: https://github.com/StrongerXi
2025-01-13 20:03:25 +00:00
18deff0262 remove allow-untyped-defs from torch/ao/nn/intrinsic/__init__.py (#144652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144652
Approved by: https://github.com/Skylion007
2025-01-13 19:36:08 +00:00
d44c3906b8 [EZ] [CD] Add 3.13 to FULL_PYTHON_VERSIONS (#144697)
Separation was necessary for Conda codegen, but now it's gone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144697
Approved by: https://github.com/atalman, https://github.com/izaitsevfb
ghstack dependencies: #144696
2025-01-13 19:12:12 +00:00
d2f905760d [EZ] [CD] Eliminate stale TODO (#144696)
As 3.13 has been enabled across the board, which one can verify by running `./github/regenerate.sh` and observe that non of the configs have changed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144696
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2025-01-13 19:12:12 +00:00
cd477cdd1d remove allow-untyped-defs from torch/ao/nn/quantized/reference/modules/linear.py (#144656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144656
Approved by: https://github.com/Skylion007
2025-01-13 19:03:05 +00:00
f93d786f73 remove allow-untyped-defs from torch/nn/parameter.pyi (#144654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144654
Approved by: https://github.com/Skylion007
2025-01-13 19:02:31 +00:00
983bf604e5 ReshapeTransform: added missing argument in docstring (#144401)
See https://github.com/pytorch/pytorch/pull/144197#discussion_r1907336339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144401
Approved by: https://github.com/janeyx99, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-13 17:59:59 +00:00
fe8c5c7a2d Update the Triton DeviceInterface in test/inductor/extension_backends/triton/device_interface.py (#144399)
Following the changes to how `DeviceInterface` is used in this [PR](https://github.com/pytorch/pytorch/pull/142033), the `DeviceInterface` in `extension_backend/triton/device_interface.py` should by updated to return the `DeviceProperties` instead of raising a NotImplementedError.

This PR mirrors the [changes](https://github.com/pytorch/pytorch/pull/142033/files#diff-06553e25e48e1d60f3030458bc46d52067d3d0c3eef2d5fcea29f7e8126bd7c9L112-R114) made in Dynamo when the PR landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144399
Approved by: https://github.com/jansel
2025-01-13 17:19:58 +00:00
bee84e88f8 [BE][Easy] improve submodule discovery for torch.ao type annotations (#144680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144680
Approved by: https://github.com/Skylion007
2025-01-13 17:16:19 +00:00
c40d917182 [MPSInductor] Fix maximum/minimum for int types (#144665)
`metal::isnan` is only defined for floats, so provide a generic wrapper
that is false for integral types

TODO: Figure out why type propagantion is not working (or should it?)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144665
Approved by: https://github.com/dcci
2025-01-13 15:14:01 +00:00
8633845090 Support nanj in inductor (#144064)
Fixes https://github.com/pytorch/pytorch/issues/144029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144064
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-13 14:29:38 +00:00
417354d953 [mps/inductor] Add support for truncdiv(). (#144666)
Two other inductor tests pass after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144666
Approved by: https://github.com/malfet
2025-01-13 13:39:38 +00:00
7e2239f1f0 [MPSInductor] Better error when kernel fails to compile (#144649)
Now error message looks as follows:
```
% python ../test/inductor/test_torchinductor.py -v -k test_cat_unbacked_2d_mps
test_cat_unbacked_2d_mps (__main__.GPUTests) ... inline_call []
stats [('calls_captured', 6)]
inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('not_ok', 1)]
ERROR

======================================================================
ERROR: test_cat_unbacked_2d_mps (__main__.GPUTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3126, in wrapper
    method(*args, **kwargs)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 12254, in new_test
    return value(self)
  File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 5885, in test_cat_unbacked_2d
    self.common(
  File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 620, in check_model_gpu
    check_model(
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 461, in check_model
    actual = run(*example_inputs, **kwargs)
  File "/Users/malfet/git/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1149, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1064, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 1977, in compile_to_module
    return self._compile_to_module()
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 2018, in _compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/codecache.py", line 2768, in load_by_key_path
    mod = _reload_python_module(key, path)
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 51, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 40, in <module>
  File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 32, in _compile_mps_shader
torch._inductor.exc.InductorError: SyntaxError: failed to compile
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        long x1 = (xindex) / (3);
        auto tmp0 = x1;
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 0;
        auto tmp3 = tmp1 >= tmp2;
        auto tmp4 = 2;
        auto tmp5 = tmp1 < tmp4;
        long x0 = (xindex) % (3);
        auto tmp6 = in_ptr0[x0 + 3*(x1)];
        auto tmp7 = tmp5 ? tmp6 : 0.0;
        auto tmp8 = tmp1 >= tmp4;
        auto tmp9 = 2 + ks0;
        auto tmp10 = static_cast<long>(tmp9);
        auto tmp11 = tmp1 < tmp10;
        auto tmp12 = 1.0;
        auto tmp13 = tmp8 ? tmp12 : 0.0;
        auto tmp14 = tmp5 ? tmp7 : tmp13;
        long x2 = xindex;
        out_ptr0[x2] = static_cast<float>(tmp14);
    }
 with program_source:18:25: error: use of undeclared identifier 'ks0'
        auto tmp9 = 2 + ks0;
                        ^

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor.py GPUTests.test_cat_unbacked_2d_mps

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.472s

FAILED (errors=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144649
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #144647, #144648
2025-01-13 13:38:03 +00:00
a85d1ee106 Update slow tests (#144670)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144670
Approved by: https://github.com/pytorchbot
2025-01-13 12:06:22 +00:00
6e77d7cac5 Add AOTAutogradCache support for cache hot loading APIs (#144499)
This diff adds AOTAutogradCache support to the mega cache.

Differential Revision: [D67991059](https://our.internmc.facebook.com/intern/diff/D67991059/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67991059/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144499
Approved by: https://github.com/oulgen
2025-01-13 07:07:18 +00:00
a08bd8154e [MPSInductor] Add support for sizevars (#144662)
Just pass them as kernel arguments

After this change  `pytest test/inductor/test_torchinduct.py -v -k _mps` reports 330 failed, 429 passed  after and 335 failed, 424 passed before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144662
Approved by: https://github.com/jansel
2025-01-13 06:22:38 +00:00
87843ee9ab [export] Unify single and multiple return for hops (#143227)
Summary: Introduce `is_hop_single_tensor_return` field to the `Node` class in serialization so that during deserialization when there is a single return, we know whether it is a tuple of a single element or a single element.

Test Plan:
```
buck2 run @mode/dev-nosan sigmoid/inference/test:e2e_test_cpu -- -r E2ETestCPUCond
buck2 run @mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding2
```

Differential Revision: D66991624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143227
Approved by: https://github.com/zhxchen17
2025-01-13 03:31:14 +00:00
0aa34e9591 Revert "Collect packages with importlib in collect_env (#144616)"
This reverts commit 3541d2a2aaacc4f15ea865c815ce8882577a439c.

Reverted https://github.com/pytorch/pytorch/pull/144616 on behalf of https://github.com/malfet due to Somehow this change causes test_bottleneck_cuda to fail ([comment](https://github.com/pytorch/pytorch/pull/144616#issuecomment-2586095595))
2025-01-13 03:11:04 +00:00
46eeef9130 [MPS][BE] Surface syntax errors shader compilation (#144648)
Before this change
```python
>>> import torch
>>> torch.mps._compile_shader('What')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/malfet/miniconda3/envs/py311/lib/python3.11/site-packages/torch/mps/__init__.py", line 157, in _compile_shader
    return torch._C._mps_compileShader(source)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Failed to create metal library, error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
" UserInfo={NSLocalizedDescription=program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
}
```
After this change
```python
>>> import torch
>>> torch.mps._compile_shader('What')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/malfet/git/pytorch/pytorch/torch/mps/__init__.py", line 157, in _compile_shader
    return torch._C._mps_compileShader(source)
SyntaxError: program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144648
Approved by: https://github.com/Skylion007
ghstack dependencies: #144647
2025-01-13 02:03:19 +00:00
9ae35b8bb1 [BE] Introduce c10::SyntaxError (#144647)
Which will be translated into Python's SyntaxError
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144647
Approved by: https://github.com/Skylion007
2025-01-12 23:23:54 +00:00
3541d2a2aa Collect packages with importlib in collect_env (#144616)
If pytorch is installed systemwide (via os package manager) or by alternative package manager like `uv`, pip is not available, causing error in `collect_env`.
However it is still possible to collect exactly the same list using `importlib` API, which is always available.

Fixes #144615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144616
Approved by: https://github.com/malfet
2025-01-12 23:21:08 +00:00
1376116ab1 Config fuzzer (#139736)
This tool makes it easy to search through config state-space with a minimal reproduction or test. It presents a similar interface to the config bisector by taking a test_function that should either raise on Exception or return False upon failure.

It has two entry points: `fuzz_n_tuple`, which tries every combination of n configs, and `bisect`, which randomly flips configs and tries to find the minimal reproduction upon failure. `bisect` is a much more efficient way to search the space, but `fuzz_n_tuple` can give you peace of mind that a new config will compose with every other config.

It's been used to find three bugs so far in the inductor config:
https://github.com/pytorch/pytorch/issues/140220 https://github.com/pytorch/pytorch/issues/140219
https://github.com/pytorch/pytorch/issues/143524

This PR also adds a bunch of missing types to the inductor config to get them to play nice with the fuzzer, so it can be a good forcing function for adding types to config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139736
Approved by: https://github.com/eellison
2025-01-12 22:59:02 +00:00
334ee8ba40 Fix a bug for conj_physical (#144391)
Fixes #141426

fix a bug in previous [PR](https://github.com/pytorch/pytorch/pull/141427), which shouldn't convert the data type for conj.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144391
Approved by: https://github.com/jansel
2025-01-12 21:18:17 +00:00
cb66146f2b [BE]: Update literal typing for torch/fx/graph nodelist (#144650)
Mentioned in discussion for #144631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144650
Approved by: https://github.com/jansel
2025-01-12 21:02:13 +00:00
91a65cbd31 [MPSInductor] Implement check_bounds (#144635)
Although at the moment it returns rather than rasises assert due to https://github.com/pytorch/pytorch/pull/144632

`pytest test/inductor/test_torchinductor.py -v -k _mps` score is `368
failed, 391 passed, 32 skipped`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144635
Approved by: https://github.com/jansel
2025-01-12 21:01:20 +00:00
fd382f1269 Micro-optimization in Graph.nodes.__iter__ (#144631)
This generates slightly better code (removing a generator frame) and
drops a redundant assert.

```py
>>> import timeit
>>> def a():
...   yield from range(3)
...
>>> def b():
...   return range(3)
...
>>> timeit.timeit(lambda: [*a()])
0.2714634328149259
>>> timeit.timeit(lambda: [*b()])
0.12076826114207506
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144631
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2025-01-12 17:46:46 +00:00
de04acaca9 Disable scuba logging for autotuning (#144568)
Summary: the compile IDs are currently null, which is confusing. Turn it off until we have a solution.

Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/g2d2g5xs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144568
Approved by: https://github.com/jamesjwu
2025-01-12 15:47:14 +00:00
1664033e13 [Functorch] Refactor vmapify autograd function: remove cell mutation (#143811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143811
Approved by: https://github.com/zou3519
2025-01-12 10:31:23 +00:00
cec245806e [MPSInductor] Implement bitcasts (#144638)
That will be used to compile something like `torch.rand(32, device='mps').view(dtype=torch.int32)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144638
Approved by: https://github.com/dcci
2025-01-12 06:11:28 +00:00
32a91dedc5 [MPSInductor] Properly generate index expressions (#144632)
Now test_slice_scatter4_mps passes

Before this change test_torchinductor.py reported 422 failed and 337 passed, after this change 412 failed 347 passed.

Fixes https://github.com/pytorch/pytorch/issues/144630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144632
Approved by: https://github.com/dcci
2025-01-12 06:10:05 +00:00
3355103233 [Dynamo] Supports autograd.Function forward returns constant (#144597)
Fixes #144142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144597
Approved by: https://github.com/jansel
2025-01-12 03:53:10 +00:00
e0f67405a1 [mps/inductor] Add support for exp(). (#144606)
inductor/test_silu now passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144606
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-12 00:38:11 +00:00
10887fc139 [BE] Enable test_public_bindings on MacOS (#144591)
I've tried it locally and it works.. (One more reason to xfail rather than skip)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144591
Approved by: https://github.com/Skylion007
2025-01-12 00:34:47 +00:00
5e858254d2 [mps/inductor] Add support for trunc(). (#144629)
inductor/test_div1 passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144629
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-12 00:11:03 +00:00
f6688ac81d remove allow-untyped-defs from torch/distributed/_shard/sharded_tensor/shard.py (#144623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144623
Approved by: https://github.com/Skylion007
2025-01-12 00:10:42 +00:00
b8aae2773f remove allow-untyped-defs from torch/distributed/_checkpointable.py (#144627)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144627
Approved by: https://github.com/Skylion007
2025-01-12 00:07:26 +00:00
b5485c9f41 remove allow-untyped-defs from torch/_functorch/utils.py (#144626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144626
Approved by: https://github.com/Skylion007
2025-01-12 00:07:16 +00:00
ad221269b0 remove allow-untyped-defs from torch/distributions/pareto.py (#144624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144624
Approved by: https://github.com/Skylion007
2025-01-12 00:06:56 +00:00
80b756ed91 remove allow-untyped-defs from torch/jit/_pickle.py (#144625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144625
Approved by: https://github.com/Skylion007
2025-01-12 00:06:25 +00:00
4f406d22a2 Revert "[mps/inductor] Add support for exp(). (#144606)"
This reverts commit 2ccbacfa24cae724ec1ea3bc7de189e5bf948d46.

Reverted https://github.com/pytorch/pytorch/pull/144606 on behalf of https://github.com/malfet due to It now passes MPS-not-supported test ([comment](https://github.com/pytorch/pytorch/pull/144606#issuecomment-2585482477))
2025-01-11 23:51:35 +00:00
eaa24821f2 Revert "[ez] add lint commits to .git-blame-ignore-revs (#144576)"
This reverts commit 49c1f81be84466d015705b1882320919eecffa82.

Reverted https://github.com/pytorch/pytorch/pull/144576 on behalf of https://github.com/janeyx99 due to need to redo with better testing ([comment](https://github.com/pytorch/pytorch/pull/144576#issuecomment-2585456893))
2025-01-11 21:53:00 +00:00
2ccbacfa24 [mps/inductor] Add support for exp(). (#144606)
inductor/test_silu now passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144606
Approved by: https://github.com/malfet
2025-01-11 18:09:33 +00:00
eqy
63569d9745 [CUDA][TF32] Add some missing TF32 decorators to test_nn.py (#144592)
Original authored by @bilal2vec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144592
Approved by: https://github.com/Skylion007
2025-01-11 16:20:59 +00:00
eqy
388b75edec [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-11 15:30:38 +00:00
2e3b051154 [XPU] Fix TRITON_XPU_BUILD_FROM_SOURCE (#142850)
Fixes #142849

The idea is to remove the redundant 'git' in TRITON_XPU_BUILD_FROM_SOURCE=1 case (L29) while keep it in pre-build whl installation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142850
Approved by: https://github.com/chuanqi129, https://github.com/benjaminglass1, https://github.com/EikanWang, https://github.com/atalman
2025-01-11 13:11:55 +00:00
b7bef1ca84 [aarch64] fix TORCH_CUDA_ARCH_LIST for cuda arm build (#144436)
Fixes #144037

Root cause is CUDA ARM build did not call `.ci/manywheel/build_cuda.sh`, but calls `.ci/aarch64_linux/aarch64_ci_build.sh `instead. Therefore, https://github.com/pytorch/pytorch/blob/main/.ci/manywheel/build_cuda.sh#L56 was not called for CUDA ARM build.

Adding the equivalent of the code to `.ci/aarch64_linux/aarch64_ci_build.sh` as a WAR.

In the future, we should target to integrate the files in  .ci/aarch64_linux/aarch64_ci_build.sh back to .ci/manywheel/build_cuda.sh.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144436
Approved by: https://github.com/atalman
2025-01-11 09:00:46 +00:00
e1d0a2ff30 [Inductor] Restrict ND tiling analysis to MemoryDeps (#144497)
# Issue
https://github.com/pytorch/pytorch/pull/137243 introduced a feature where the ND tiling algorithm analyzes memory dependencies. It iterates over all `Dep`'s of the kernel.  However, the analysis is only applicable to `MemoryDep` instances, which are a subclass of `Dep`. In particular, it doesn't work for `StarDep`'s, for the reasons described here: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/simd.py#L1653

# Fix
This PR changes the algorithm to only iterate over `MemoryDep` instances.

# Testing
Parameterized an existing test for `torch.bucketize` to also run with ND tiling. This test emits a node with `StarDep`'s. Without this PR, the compiler would crash on this test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144497
Approved by: https://github.com/eellison
2025-01-11 05:16:47 +00:00
e4b2e90e54 Fix broken YAML template after #144574 (#144604)
The YAML syntax is wrong and GitHub complains about it https://github.com/pytorch/pytorch/blob/main/.github/ISSUE_TEMPLATE/pt2-bug-report.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144604
Approved by: https://github.com/wdvr
2025-01-11 05:09:06 +00:00
11082aead3 [Pipelining] Fix FSDP+PP stream sync bug (#144535)
This bug could cause gradient corruption as a race condition exists
between FSDP's reduce-scatter and any operations reading .grad on the
main stream.  The root cause is that pipelining stage .backward implementation
got modified to support zero-bubble and in doing so, invoked .grad()
instead of .backward(), and performed manual gradient accumulation and
manually called into hooks for FSDP.  But one key hook was missed for
FSDP, the '_root_post_backward_final_callback' hook, which is
responsible for syncing the grad reduction ops after the last layer's
backward completes.

Note: this fix applies to both zero-bubble and non-zero-bubble schedules.  This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks.  However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered.

A better fix as a follow up PR would be to invoke .backward() for the
weight grad, so that we never have to disable or manually invoke hooks.

Modified test_pp_dp to intentionally race against FSDP's reduce by
modifying the parameters inplace in a mathematically identical way, and
confirmed it fails intermittently when the FSDP sync is not applied and
passes with the FSDP sync added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535
Approved by: https://github.com/awgu
ghstack dependencies: #144534
2025-01-11 03:42:15 +00:00
1d3cd7bd09 [Pipelining] Improve test_pp_dp (#144534)
Some refactoring, but important changes include
- initializing the weights properly so there are more nonzero gradients
  flowing, which helped catch the DDP+PP+ZB bug
- make the DDP+ZB+PP bug skip for now and file an issue
- tighten the tolerances to defaults
- use separate targets instead of same inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144534
Approved by: https://github.com/H-Huang
2025-01-11 03:27:16 +00:00
8fa47c9455 [dynamo] log compiler collective duration to tlparse chromium trace (#144372)
To show wall time in tlparse for the synchronous compiler collective. Can eliminate the leading hypothesis from https://fb.workplace.com/groups/1075192433118967/permalink/1578670289437843.

<img width="1296" alt="image" src="https://github.com/user-attachments/assets/b17d4efb-8573-43e5-af58-c51af05acb54" />

sample: https://gist.github.com/xmfan/19eeaa80d55a4e7c168e150355ec7392
rank 0: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpr5WNMt/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10
rank 1: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpr5WNMt/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144372
Approved by: https://github.com/ezyang
2025-01-11 03:10:39 +00:00
0cd9320c7f easy: dynamo_config: sort keys and set values (#143317)
This will create consistent ordering of keys when writing, as well as
sorting sets before serializing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143317
Approved by: https://github.com/masnesral
ghstack dependencies: #143307
2025-01-11 03:08:04 +00:00
074aca3ed2 [user triton] add support for @triton.heuristics after @triton.autotune (#142208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142208
Approved by: https://github.com/zou3519
2025-01-11 02:18:26 +00:00
3753d30273 Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)"
This reverts commit 9f09b719d33c61224ebb85baa369a8364063aa6f.

Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it somehow breaks memory leak checks ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2585004792))
2025-01-11 02:10:16 +00:00
49c1f81be8 [ez] add lint commits to .git-blame-ignore-revs (#144576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576
Approved by: https://github.com/janeyx99
2025-01-11 02:09:46 +00:00
92ddb3d3d3 [MPS] Expose MPSProfiler::start/stopCapture to Python (#144561)
I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager.

For example, code below:
```python
import torch
import os

def foo(x):
   return x[:,::2].sin() + x[:, 1::2].cos()

if __name__ == "__main__":
    os.environ["MTL_CAPTURE_ENABLED"] = "1"
    x = torch.rand(32, 1024, device="mps")

    with torch.mps.profiler.metal_capture("compiled_shader"):
        torch.compile(foo)(x)
```
should capture the execution of a `torch.compile` generated shader
<img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561
Approved by: https://github.com/manuelcandales
ghstack dependencies: #144559, #144560
2025-01-11 02:05:36 +00:00
c7dbee5106 [reland][export] don't decompose custom triton op when exporting (#144284)
Summary:
A reland of https://github.com/pytorch/pytorch/pull/142426.

Copying the description over here:

For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable.

The alternative:
If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because:

it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes.
changes to triton or the serialization logic for triton arguments can be BC breaking
exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction.

Future plans:
After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC.

In the long term, we may export multiple cubins for the triton op directly.

Test Plan: see new tests.

Differential Revision: D67879685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144284
Approved by: https://github.com/zou3519
2025-01-11 01:34:35 +00:00
95d333f52e [distributed] Fix _ReaderView.read() and readinto() to stop reading at the end of the slice (#143357)
_ReaderView doesn't work correctly if the slice ends past the view.

read(-1) would call read(-1) on the base_stream, which would consume the entire underlying stream, even if the view ended before that.
read(n) would read n bytes, even if the view ended before that.

The new implementation clamps the size read to the size of the view.

readinto(b) would read len(b) bytes, even if the view ended before that.

Since the interface depends on the size of b, we use a (potentially) shortened view into b to avoid a copy.  If the view doesn't contain enough data to fill the view, then this will appear as end of stream to the caller, which is the desired behavior.

This fix should not be user facing, since the bug is in an internal helper, and is only visible with new code down the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143357
Approved by: https://github.com/saumishr
2025-01-11 00:22:10 +00:00
c9afa00a85 update sleef for disable libm on Windows [submodule Sleef] (#142245)
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `Sleef` to contains it's PRS: https://github.com/shibatch/sleef/pull/603
2. Set `SLEEF_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `Sleef`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142245
Approved by: https://github.com/EikanWang, https://github.com/atalman

Co-authored-by: Eikan Wang <eikan.wang@intel.com>
2025-01-11 00:11:55 +00:00
cyy
6cfc081675 Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138)
To facilitate further possible changes of DeviceIndex to int16_t.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144138
Approved by: https://github.com/albanD
2025-01-10 23:53:19 +00:00
b80ecc4457 Revert "Fix poision child process issue when call getAccelerator() (#144368)"
This reverts commit 2583d831d40d6fa64f0b637d5bc7598e484a3283.

Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))
2025-01-10 23:36:43 +00:00
db2a30932a Revert "Generalize at::manual_seed for all accelerators (#144370)"
This reverts commit eeb57394f93d720bca498c3fa9d167fc7b9cca46.

Reverted https://github.com/pytorch/pytorch/pull/144370 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))
2025-01-10 23:36:43 +00:00
9ec8ecea71 Update documentation.yml 2025-01-10 15:27:28 -08:00
1ff8a1c4eb Update documentation.yml to request english 2025-01-10 15:26:43 -08:00
c7f12a4a7b [MPSInductor] Speedup maximum/minumum ops (#144581)
By relying on the fact that if either `a` or `b` is NaN (or both), than `a + b` would also be NaN.

I.e. it replaces
```metal
auto tmp2 = metal::any(metal::isnan(static_cast<decltype(tmp0+tmp1)>(tmp0))) | metal::any(metal::isnan(static_cast<decltype(tmp0+tmp1)>(tmp1))) ? static_cast<decltype(tmp0+tmp1)>(NAN) : metal::max(static_cast<decltype(tmp0+tmp1)>(tmp0), static_cast<decltype(tmp0+tmp1)>(tmp1));
```
with
```metal
auto tmp2 = metal::isnan(tmp0 + tmp1) ? tmp0 + tmp1 : metal::max(static_cast<decltype(tmp0+tmp1)>(tmp0), static_cast<decltype(tmp0+tmp1)>(tmp1));
```

which according to MetalProfiler takes fewer instructions:
<img width="520" alt="image" src="https://github.com/user-attachments/assets/54659392-012b-453e-9c02-c3c5f332074a" />
vs
<img width="1031" alt="image" src="https://github.com/user-attachments/assets/55fcfa78-1ea5-4b0a-8154-d79b3e3cc400" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144581
Approved by: https://github.com/dcci, https://github.com/jhavukainen
2025-01-10 22:58:00 +00:00
a94ec0a9a5 [aoti] Remove example inputs from aoti_compile_and_package (#144520)
Summary: The args were removed in https://github.com/pytorch/pytorch/pull/140991

Test Plan: CI

Differential Revision: D67998954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144520
Approved by: https://github.com/yushangdi
2025-01-10 21:56:23 +00:00
6b902e6e1a Update bug-report.yml to make it not look weird
Seems like https://github.com/pytorch/pytorch/pull/144574 did not format as expected.
2025-01-10 13:53:27 -08:00
4daf007b64 Request English for Issues (#144574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144574
Approved by: https://github.com/albanD
2025-01-10 21:51:15 +00:00
68dad26b95 torch/nn/modules/linear.py: docs: improvements (#138484)
torch/nn/modules/linear.py: docs: improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138484
Approved by: https://github.com/mikaylagawarecki
2025-01-10 20:03:43 +00:00
7a81ba18b9 [export] Add support for serializing symint inputs (#142284)
Fixes https://github.com/pytorch/pytorch/issues/142167
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142284
Approved by: https://github.com/avikchaudhuri
2025-01-10 20:03:26 +00:00
18c1dcb8f3 docs: get rid of copyright year (#144562)
Fixes https://github.com/pytorch/pytorch/pull/144153#pullrequestreview-2540418083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144562
Approved by: https://github.com/albanD
2025-01-10 19:57:25 +00:00
be5afe16a6 Fix deepcopy hooks (#144531)
Summary: As title, fix bug when a GraphModule doesn't have _deepcopy_hooks attribute

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//torchmultimodal/tests:tests -- --exact 'torchmultimodal/tests:tests - test_albef.py::test_dequeue_and_enqueue'
```

Differential Revision: D68002767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144531
Approved by: https://github.com/BoyuanFeng
2025-01-10 19:55:22 +00:00
10ff6b8894 [export] Add pickle protocol (#142253)
Fixes https://github.com/pytorch/pytorch/issues/142004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142253
Approved by: https://github.com/avikchaudhuri
2025-01-10 19:49:07 +00:00
396630ed78 Update the accuracy results for moco and llama (#144523)
This has been failing in trunk for sometimes, let's just update the accuracy results first.  The command I run `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py 127f836881e75e0c688619b54a35b018a69d7ee7`.  I also fix the update script a bit to make it working after https://github.com/pytorch/pytorch/pull/139337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144523
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2025-01-10 19:40:49 +00:00
99600789c3 [ROCm][Inductor][CK] hackfix for segfault in addmm op (#144519)
This snippet used to cause segfault on GPU due to incorrect input order when invoking the kernel

```
import os
import torch
import torch.nn as nn

from torch._inductor import config as inductor_config
from torch._inductor.utils import fresh_inductor_cache

M, N, K = 128, 128, 4096
dtype = torch.float16

X = torch.randn(M, N, dtype=dtype).cuda()
A = torch.randn(M, K, dtype=dtype).cuda()
B = torch.randn(K, N, dtype=dtype).cuda()

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, b, x, y):
        return torch.addmm(b, x, y)

import ck4inductor
ck_dir = os.path.dirname(ck4inductor.__file__)

with fresh_inductor_cache():
    with inductor_config.patch(
        {
            "max_autotune_gemm_backends": "CK",
            "autotune_fallback_to_aten": False,
            "compile_threads": 144,
            "rocm.ck_dir": ck_dir,
        }
    ):
        compiled_model = torch.compile(SimpleModel(), mode="max-autotune")
        res = compiled_model(X, A, B)
        res_eager = torch.addmm(X, A, B)
        torch.testing.assert_close(res, res_eager)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144519
Approved by: https://github.com/chenyang78
2025-01-10 19:29:14 +00:00
a37db5ae39 operator benchmark change parsing from regex based to manual (#144297)
The regex-based parser would erroneously split on commas in nested brackets, for example, it would do the following parse which is wrong:
'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16)', ' (64, 32)]', 'ZPB: 2']

The new manual parser handles this situation the right way:
'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16), (64, 32)]', 'ZPB: 2']

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144297
Approved by: https://github.com/XuehaiPan, https://github.com/jeffdaily
2025-01-10 19:15:36 +00:00
4f04078aec [CI] Ensure ACL is obtained from GitHub (#141804)
- The GitHub tagged releases is the preferred method to obtain ACL.

Please merge this before https://github.com/pytorch/pytorch/pull/138889 so that PyTorch can take GitHub releases going forward instead of mlplatform.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141804
Approved by: https://github.com/snadampal, https://github.com/ng-05, https://github.com/digantdesai
2025-01-10 19:05:02 +00:00
cyy
4abf554882 Use structure binding (#144524)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144524
Approved by: https://github.com/Skylion007
2025-01-10 18:47:35 +00:00
1ce3524277 use collective_comm activity for hccl traces (#144490)
Summary: Use existing collective_comm (currently used for nccl traces) for hccl traces as well. Only init the nccl profiler when KINETO_HAS_NCCL_PROFILER is defined so as to not init it when the build is for MTIA/HCCL

Test Plan: CIs

Differential Revision: D67285333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144490
Approved by: https://github.com/sraikund16
2025-01-10 18:39:35 +00:00
868984c3e3 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-10 18:27:00 +00:00
b46d00c1b7 Shard RegisterDispatchKey (#144364)
Should fix https://github.com/pytorch/pytorch/issues/143952 .

Testing: built PyTorch on Raspberry Pi 5; this seemed to alleviate high peak memory requirement. (I did increase shard counts for other generated files along the way, but I need to go back and figure out how much of that was strictly necessary vs. needing to use -j1 or -j2.)

Differential Revision: [D67925496](https://our.internmc.facebook.com/intern/diff/D67925496/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144364
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #144363
2025-01-10 18:21:19 +00:00
4143312e67 S390x ci periodic tests (#125401)
Periodically run testsuite for s390x

**Dependencies update**
Package z3-solver is updated from version 4.12.2.0 to version 4.12.6.0. This is a minor version update, so no functional change is expected.
The reason for update is build on s390x. pypi doesn't provide binary build for z3-solver for versions 4.12.2.0 or 4.12.6.0 for s390x. Unfortunately, version 4.12.2.0 fails to build with newer gcc used on s390x builders, but those errors are fixed in version 4.12.6.0. Due to this minor version bump fixes build on s390x.

```
# pip3 install z3-solver==4.12.2.0
...
      In file included from /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:53:
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp: In member function ‘void* region::allocate(size_t)’:
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/tptr.h:29:62: error: ‘uintptr_t’ does not name a type
         29 | #define ALIGN(T, PTR) reinterpret_cast<T>(((reinterpret_cast<uintptr_t>(PTR) >> PTR_ALIGNMENT) + \
            |                                                              ^~~~~~~~~
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:82:22: note: in expansion of macro ‘ALIGN’
         82 |         m_curr_ptr = ALIGN(char *, new_curr_ptr);
            |                      ^~~~~
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:57:1: note: ‘uintptr_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
         56 | #include "util/page.h"
        +++ |+#include <cstdint>
         57 |
```

**Python paths update**
On AlmaLinux 8 s390x, old paths:
```
python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'
/usr/lib/python3.12/site-packages
```

Total result is `/usr/lib/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages`

New paths:
```
python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))'
/usr/local/lib64/python3.12/site-packages;/usr/local/lib/python3.12/site-packages;/usr/lib64/python3.12/site-packages;/usr/lib/python3.12/site-packages;/usr/local/lib64/python3.12/site-packages/torch;/usr/local/lib/python3.12/site-packages/torch;/usr/lib64/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages/torch
```

```
# python -c 'import torch ; print(torch)'
<module 'torch' from '/usr/local/lib64/python3.12/site-packages/torch/__init__.py'>
```

`pip3 install dist/*.whl` installs torch into `/usr/local/lib64/python3.12/site-packages`, and later it's not found by cmake with old paths:

```
CMake Error at CMakeLists.txt:9 (find_package):
  By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "Torch", but
  CMake did not find one.
```

https://github.com/pytorch/pytorch/actions/runs/10994060107/job/30521868178?pr=125401

**Builders availability**
Build took 60 minutes
Tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards)

60 + 150 + 110 + 65 + 55 + 115 + 85 + 50 + 70 + 105 + 110 = 975 minutes used. Let's double it. It would be 1950 minutes.

We have 20 machines * 24 hours = 20 * 24 * 60 = 20 * 1440 = 28800 minutes

We currently run 5 nightly binaries builds, each on average 90 minutes build, 15 minutes test, 5 minutes upload, 110 minutes total for each, 550 minutes total. Doubling would be 1100 minutes.

That leaves 28800 - 1100 = 27700 minutes total. Periodic tests would use will leave 25750 minutes.

Nightly binaries build + nightly tests = 3050 minutes.

25750 / 3050 = 8.44. So we could do both 8 more times for additional CI runs for any reason. And that is with pretty good safety margin.

**Skip test_tensorexpr**
On s390x, pytorch is built without llvm.
Even if it would be built with llvm, llvm currently doesn't support used features on s390x and test fails with errors like:
```
JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
unknown file: Failure
C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
```
**Disable cpp/static_runtime_test on s390x**

Quantization is not fully supported on s390x in pytorch yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125401
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-10 18:21:07 +00:00
603e1c0b02 torchgen: move dispatch_helpers out of RegisterDispatchDefinitions.ini (#144363)
The dispatch_helpers should be generated once, not once per kernel namespace.

Differential Revision: [D67925497](https://our.internmc.facebook.com/intern/diff/D67925497/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144363
Approved by: https://github.com/bdhirsh
2025-01-10 18:13:06 +00:00
7a93a58b3c fix typo: "assumbed" (#144543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144543
Approved by: https://github.com/Skylion007
2025-01-10 17:16:01 +00:00
fdc4f9dde2 Avoid running helper functions as test (#144544)
Pytest considers all symbols starting with `test_` as a test case/function and runs them.
The `test_compiled_fsdp` is a decorator but due to the import discovered by pytest.
Rename it to avoid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144544
Approved by: https://github.com/Skylion007
2025-01-10 17:15:50 +00:00
8dba1ce73b [MPS] Make MPSProfiler usable from C++ (#144560)
By moving `buildTensorString` implementation away from the header
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144560
Approved by: https://github.com/Skylion007
ghstack dependencies: #144559
2025-01-10 17:13:34 +00:00
f604338e31 [MPS] Make sure that MPSStream is usable from C++ (#144559)
It's intended to be, but this was never tested.
This change introduces no new functionality, just properly isolates ObjC implementation details from the potential C++ caller
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144559
Approved by: https://github.com/Skylion007
2025-01-10 17:13:34 +00:00
473b745cb9 Revert "[dynamo] Avoid graph break on updates to obj.__dict__ (#144419)"
This reverts commit c8595ba7d02fea9a5642ebbb60a810d18dc60666.

Reverted https://github.com/pytorch/pytorch/pull/144419 on behalf of https://github.com/clee2000 due to newly added test fails internally D68004708 ([comment](https://github.com/pytorch/pytorch/pull/144419#issuecomment-2583265412))
2025-01-10 16:59:38 +00:00
e6b9e67465 [BE][Opinfo] Delete redundant dtypesIfCUDA (#144512)
If they are the same as CPU, no need to have that extra line

Discovered while reviewing https://github.com/pytorch/pytorch/pull/143833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144512
Approved by: https://github.com/Skylion007
2025-01-10 15:15:38 +00:00
a222029f4e retracing in strict doesn't like dataclass registration (#144487)
Retracing in strict doesn't seem to like dataclass registration. Just refactoring some tests to make this explicit (whereas other export testing variants work fine).

Differential Revision: [D67985149](https://our.internmc.facebook.com/intern/diff/D67985149/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144487
Approved by: https://github.com/angelayi
2025-01-10 12:31:53 +00:00
b2fde28283 [Profiler] Fix device setting error of other backends in torch.profiler (#144237)
In earlier implementation, if `self.use_device != "cuda"` and `device is None`, we would get a `device = "cpu"` from line401, which is not as expected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144237
Approved by: https://github.com/sraikund16
2025-01-10 10:41:11 +00:00
eeb57394f9 Generalize at::manual_seed for all accelerators (#144370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
ghstack dependencies: #144368
2025-01-10 09:28:28 +00:00
2583d831d4 Fix poision child process issue when call getAccelerator() (#144368)
# Motivation
fix https://github.com/pytorch/pytorch/issues/144152

# Solution

- Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch.
- Define `at::hasXXX` to determine if accelerator XXX is available at runtime.
- Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui
2025-01-10 09:28:27 +00:00
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
184549b2d7 Fix torch.normal ignores default_device (#144070)
Fixes #122886

1. Enable `torch.normal` working with `DeviceContext` to get default device which set via `set_default_device`.
2. Add hint in `set_default_device` doc, suggest use `torch.Tensor.to` method move to desired device explicitly.

**Test Result**
1. **Doc Preview**
![image](https://github.com/user-attachments/assets/eb69c334-be2b-4dc5-bdce-567da21e1635)

2. **Local Test**
```python
>>> import torch
>>> torch.normal(0.,1., (10,10)).device
device(type='cpu')
>>> torch.set_default_device('cuda')
>>> torch.normal(0.,1., (10,10)).device
device(type='cuda', index=0)
```

```bash
pytest test/test_tensor_creation_ops.py
```

![image](https://github.com/user-attachments/assets/8b466b55-f162-4b83-8b20-71de2c1d0914)

```bash
lintrunner
```
![image](https://github.com/user-attachments/assets/5b269c50-da57-47ed-8500-4edf2c2295e4)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144070
Approved by: https://github.com/ezyang
2025-01-10 08:19:55 +00:00
1fe3af2c68 Migrate from Tuple -> tuple in torch/_dynamo (#144261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144261
Approved by: https://github.com/aorenste, https://github.com/zou3519
2025-01-10 07:45:57 +00:00
f295eff512 [Profiler] Hide Kineto Step Tracker Behind Env Var (#144494)
Summary:
To support iteration-based on-demand we have step tracker hooks for both the scheduler and for the optimizer to control Kineto's backend FSM. We already hide the optimizer step tracker behind and ENV_VAR to prevent any extra overhead from the frontend profiler down to the kineto backend, but we don't do any such thing for the profiler step tracker. It also seems to cause errors occasionally in the FSM having both auto-trace and on-demand occurring at the same time.

To remedy this issue, lets put in a patch to guard the step incrementer for the frontend step function. This will bypass all of the on-demand logic which shouldn't occur in auto-trace

Test Plan:
Ran
`buck run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test -- --enable_profiling --trace_handler=auto_trace --with_stack` and added prints in on-demand functions (performLoopStep and collectTrace) and saw that neither were called even though they were called on main.

Also got following healthy traces:

Auto-Trace (schedule-based):
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jan_09_12_43_37.1122140.pt.trace.json.gz&bucket=gpu_traces

Timing Based On-demand:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1736456722/localhost/libkineto_activities_1286261.json.gz&bucket=gpu_traces

Iteration Based On-demand:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1736456889/localhost/libkineto_activities_1304781.json.gz&bucket=gpu_traces

Differential Revision: D67990080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144494
Approved by: https://github.com/ngimel
2025-01-10 07:00:56 +00:00
8cc8989b26 [Inductor UT] Generalize newly introduced device-bias hard code in (#144456)
Re-land #143975. Fix "cuda" hard code in test_pattern_matcher.py introduced by #139321
Fix #143974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144456
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel
ghstack dependencies: #144457
2025-01-10 06:55:44 +00:00
e5111d0430 [Inductor UT] Add expected failure for newly added case on XPU, align CUDA. (#144457)
The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU.
 We add the expected failure here because if fails with the same reason as CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144457
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel, https://github.com/liangan1
2025-01-10 06:55:44 +00:00
eddf83559e [Intel GPU][Inductor] Convert Conv1D to 2D in inductor (#144140)
Layout optimization in inductor does not apply to Conv1D. We convert Conv1D to channel last Conv2D for better performance on Intel GPU. For example, demucs fp16 inference in torchbench can improve from 149ms to 91ms on Max 1100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144140
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2025-01-10 06:50:46 +00:00
fbad833538 Migrate from Tuple -> tuple in test/distributed/_composable (#144254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144254
Approved by: https://github.com/aorenste
2025-01-10 06:38:05 +00:00
3b6b306b71 Migrate from Tuple -> tuple in torch/testing (#144256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256
Approved by: https://github.com/aorenste
2025-01-10 06:37:55 +00:00
493a52cb72 Refine torch.xpu.get_device_properties API error message (#144379)
# Motivation
Remove the redundant error message.

Without this PR:
```python
>>> import torch
>>> torch.xpu.get_device_name(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 215, in get_device_name
    return get_device_properties(device).name
  File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 258, in get_device_properties
    raise AssertionError("Invalid device index")
AssertionError: Invalid device index
```

With this PR:
```python
>>> import torch
>>> torch.xpu.get_device_name(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
  File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 215, in get_device_name
    return get_device_properties(device).name
  File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 257, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]  # noqa: F821
RuntimeError: The device index is out of range. It must be in [0, 1), but got 1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144379
Approved by: https://github.com/EikanWang
2025-01-10 06:27:51 +00:00
4375c2c534 Cleanup gpt_fast benchmark (#144517)
This is an exact copy of https://github.com/pytorch/pytorch/pull/144484, I bricked the last PR running ghstack land :(

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144517
Approved by: https://github.com/davidberard98, https://github.com/huydhn
2025-01-10 05:22:13 +00:00
c8595ba7d0 [dynamo] Avoid graph break on updates to obj.__dict__ (#144419)
`obj.__dict__` is handled specially in Dynamo, and prior to this patch
we only support read and membership check on that dictionary object.

This patch adds support for writes and some documentation.

Fixes #143756.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-01-10 05:22:04 +00:00
d100a92d33 [CPU][Brgemm] add support for int8 brgemm (#143384)
For INT8 SDPA kernel usage, we add support for INT8 Brgemm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143384
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/ezyang
2025-01-10 04:20:26 +00:00
0529908f13 Remove is_reduced_floating_point from namespace std (#144502)
Partial fix for #144495. Avoiding BC-break using existing practice of removing only if FBCODE_CAFFE2 and C10_NODEPRECATED are not defined.

Differential Revision: [D67992342](https://our.internmc.facebook.com/intern/diff/D67992342/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144502
Approved by: https://github.com/malfet
2025-01-10 03:24:10 +00:00
cyy
9a841f9321 Enable bugprone-unchecked-optional-access (#144226)
We can actually enable bugprone-unchecked-optional-access without the risk of hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144226
Approved by: https://github.com/albanD
2025-01-10 03:16:56 +00:00
9f09b719d3 Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483
Approved by: https://github.com/Skylion007
2025-01-10 02:31:43 +00:00
898fcb4590 Simplify vec128 bfloat16/half fmadds (#144486)
I was being silly when I wrote these; it doesn't make sense to do four conversions and two FMAs when we could do a multiply and an add.

Differential Revision: [D67985074](https://our.internmc.facebook.com/intern/diff/D67985074/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144486
Approved by: https://github.com/malfet
2025-01-10 02:25:57 +00:00
d1b64ec326 [export] Fix sym_bool serialization (#144295)
Summary:
When there is a `torch._check()` that checks if a sym_int is equal to some constant, it will generate 3 nodes in the graph with target `operation.ge`, `operator.le` and `operator.eq`. These operators belong to `_SYM_BOOL_OPS` but the `meta_val` of these nodes are are `bool` instead of `torch.SymBool`.

Similar things can happen to `torch.SymInt`, where a `node.target` belongs to `_SYM_INT_OPS` but `node.meta["val"]` is an `int` instead of `torch.SymInt`.

Therefore, we need to check both `meta_val` type and `node.target` type during serialization.

Test Plan:
```
buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_sym_bool_torch_check_equal
buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_sym_int_torch_check_equal
```

Differential Revision: D67883754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144295
Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi
2025-01-10 02:07:54 +00:00
6de110b862 Support with statement on torch.Stream (#140138)
# Motivation
We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`.

With this PR, we can do like this
```python
s1= torch.Stream()
# Set s1 to the current stream
torch.accelerator.set_stream(s1)
with torch.Stream() as s2:
    # Inside with statement, we set s2 to the current stream
    assert torch.accelerator.current_stream() == s2
# Here the current stream should be s1
assert torch.accelerator.current_stream() == s1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138
Approved by: https://github.com/albanD
2025-01-10 02:05:19 +00:00
04cb19d225 Add instantiation level to CutlassArgs (#144506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144506
Approved by: https://github.com/huydhn
2025-01-10 02:01:40 +00:00
87c1f76e63 Revert "Migrate from Tuple -> tuple in torch/_decomp (#144260)"
This reverts commit 8db67e03193dd1dbf7ca80cf0eb2f904e18e25ec.

Reverted https://github.com/pytorch/pytorch/pull/144260 on behalf of https://github.com/kit1980 due to Lots of inductor failures ([comment](https://github.com/pytorch/pytorch/pull/144260#issuecomment-2581572235))
2025-01-10 01:47:29 +00:00
bf6dd955cd Fix max(map(...)) (#142443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142443
Approved by: https://github.com/zou3519
2025-01-10 01:44:37 +00:00
1dd1d532ba [BE] Fix extra-semi warnings in int4mm_kernel.cpp (#144510)
Fixes
```
In file included from /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/native/cpu/int4mm_kernel.cpp.DEFAULT.cpp:1:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cpu/int4mm_kernel.cpp:998:2: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
};
 ^

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144510
Approved by: https://github.com/kit1980
2025-01-10 01:17:31 +00:00
bd1f5d1c32 update xnnpack for disable libm on Windows [submodule XNNPACK] (#141943)
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `XNNPACK` to contains it's PRS: https://github.com/google/XNNPACK/pull/7456, https://github.com/google/XNNPACK/pull/7535 and other build fixing PRs.
2. Set `XNNPACK_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `XNNPACK`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141943
Approved by: https://github.com/atalman
2025-01-10 00:47:41 +00:00
8db67e0319 Migrate from Tuple -> tuple in torch/_decomp (#144260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144260
Approved by: https://github.com/aorenste
2025-01-10 00:13:15 +00:00
3607ff2c1d Migrate from Tuple -> tuple in benchmarks/instruction_counts/core (#144253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144253
Approved by: https://github.com/aorenste
2025-01-10 00:12:23 +00:00
a55977f763 Migrate from Tuple -> tuple in torch/ao (#144265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144265
Approved by: https://github.com/aorenste
2025-01-10 00:12:06 +00:00
08eaaa61ea Inductor dashboard benchmarks: swap unused freeze_autotune_cudagraphs workflow for cppwrapper workflow (#144427)
GitHub limits us to 10 inputs per workflow_dispatch job, so this PR swaps out an input that is no longer used for the cppwrapper input. See [the HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2002%20Jan%202025%2016%3A30%3A07%20GMT&stopTime=Thu%2C%2009%20Jan%202025%2016%3A30%3A07%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/53/orig&lCommit=4c3d3ad3c7886cbda9705b41c6db5fa7da0d6fe9&rBranch=main&rCommit=00df63f09f07546bacec734f37132edc58ccf574) for an example showing that it works and displays sane output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144427
Approved by: https://github.com/desertfire, https://github.com/huydhn
2025-01-09 23:56:00 +00:00
66ce13b497 Revert D67299312: Multisect successfully blamed "D67299312: [AoTI Minifier] UX Improvement" for one test failure (#144475)
Summary:
This diff partially reverts D67299312
D67299312: [AoTI Minifier] UX Improvement by yushangdi causes the following test failure:

Differential Revision: D67963019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144475
Approved by: https://github.com/zhxchen17, https://github.com/angelayi
2025-01-09 23:27:55 +00:00
91cbeb7db9 [MPSInductor] Fix masked/where for inf values (#144500)
Move constant to value logic to `value_to_metal` function (similar to `value_to_cpp`)

Call it from `constant` as well as `where` ops (which is in turn being called from `masked` op

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144500
Approved by: https://github.com/dcci
2025-01-09 23:11:06 +00:00
b1c2c3967a [dtensor] deprecate _shard_tensor to use src_data_rank=None (#144171)
as titled, we can achieve no comm sharding for the inference case with
src_data_rank=None, so deprecate the private APi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144171
Approved by: https://github.com/awgu
2025-01-09 22:26:45 +00:00
379b54603a [Inductor] [bc-breaking] Node Level provenance tracking (#144277)
Summary:

- use GraphTransformObserver + replace_node hooks to track node sources when they are replaced
- add pre_grad_graph tracking to tlparse
- add the node provenance information to post_grad_graph tlparse. This is for the frontend to create a mapping between pre_grad and post_grad graph. See an example frontend (this is just a prototype) here:  https://drive.google.com/file/d/1cMHH_0y4FJUSS9tATwGQvA72O0Lth8eh/view?usp=sharing
- change "action" of NodeSource from a single action to a list of actions.

- It's BC-Breaking because we removed `GraphTransformObserver`'s class methods `on_node_erase` and `on_node_erase` .

https://docs.google.com/document/d/1dGh9myqNhywmbfP0Quzx_f04bghDFlj8cawj8MopiO8/edit?tab=t.0

The front-end code that takes in the tlparse result is in https://github.com/yushangdi/compiler_explorer.
ghstack-source-id: 260390519

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r node_source
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance
```

Front-end example screenshots on a real model, 93% coverage rate between pre_grad_graph and post_grad_graph

 {F1973584210}{F1973584209}

```
buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark

MODEL_ENTITY_ID=644688112
SNAPSHOT_ID=32
MODULE=merge

TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=7 TORCH_LOGS="+inductor,+schedule,output_code,graph_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/ec86b05dd59e84db/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --local-model /home/bahuang/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune':
True}"

buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:auto_functionalize
```

Differential Revision: D65006709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144277
Approved by: https://github.com/desertfire
2025-01-09 22:06:51 +00:00
28b1960d49 [CUDA] parse arch-conditional compute-capability when building extensions (#144446)
don't choke on arch-conditional compute capabilities e.g., `sm_90a`: #144037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144446
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-01-09 22:05:18 +00:00
206a932f23 [Submodule] Upgrade to Cutlass 3.6 (#144180)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-09 21:56:53 +00:00
3e7e435bb1 [codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +2 (#144371)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144371
Approved by: https://github.com/Skylion007
2025-01-09 21:49:17 +00:00
f71688f30d Revert "[Submodule] Upgrade to Cutlass 3.6 (#144180)"
This reverts commit f2c103317814eecf2b622e322e4d0877c16af943.

Reverted https://github.com/pytorch/pytorch/pull/144180 on behalf of https://github.com/huydhn due to Ops, this fails some slow tests.  Please help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/144180#issuecomment-2581302233))
2025-01-09 21:45:39 +00:00
127f836881 S390x cancelled jobs cleanup (#144149)
Sometimes job is cancelled during nested docker container creation.
This leads to nested docker container not being stopped and worker hanging forever in the job.
Improve nested docker containers cleanup for these cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144149
Approved by: https://github.com/seemethere
2025-01-09 20:45:19 +00:00
40305dd37e [onnx] Fix bug for exporting torch.cdist into onnx and support 'compute_mode' (#144213)
### Fix bug for exporting torch.cdist and support 'compute_mode'
In [cdist,](https://github.com/pytorch/pytorch/blob/main/torch/onnx/symbolic_opset9.py#L6181) the 'compute_mode' was ignored, which leads to a big difference of the computation flow between original torch.cdist and the exported onnx file when computing Euclidean distance (p=2). For computing Euclidean distance, the running of exported onnx model will be 10x slower than running torch.cdist directly, and also very likely to cause CUDA OOM for larger matrixes unnecessarily.

This code is going for exporting the same onnx computation flow with the forward of  torch.cdist defined at [forward implementation](9225f149eb/aten/src/ATen/native/Distance.cpp (L66-L149).) under every compute_mode.

Fixes #144212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144213
Approved by: https://github.com/justinchuby
2025-01-09 20:07:20 +00:00
2b241a8206 Amazon Linux 2023: Preload cusparseLt.so (#144477)
Fixes https://github.com/pytorch/pytorch/issues/144433

Test with some debug statements added:

```
>>> import torch
trying to load libcublas.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12']
trying to load libcublas.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12
trying to load libcudnn.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9']
trying to load libcudnn.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9
trying to load libnvrtc.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12']
trying to load libnvrtc.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12
trying to load libcudart.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12']
trying to load libcudart.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12
trying to load libcupti.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12']
trying to load libcupti.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12
trying to load libcufft.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11']
trying to load libcufft.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11
trying to load libcurand.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10']
trying to load libcurand.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10
trying to load libnvJitLink.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12']
trying to load libnvJitLink.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12
trying to load libcusparse.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12']
trying to load libcusparse.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12
trying to load libcusparseLt.so.*[0-9] from []
trying to load libcusparseLt.so.*[0-9] from /usr/local/lib/python3.9/site-packages/cusparselt/lib/libcusparseLt.so.0
trying to load libcusolver.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11']
trying to load libcusolver.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11
trying to load libnccl.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2']
trying to load libnccl.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2
trying to load libnvToolsExt.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvtx/lib/libnvToolsExt.so.1']
trying to load libnvToolsExt.so.*[0-9] from /usr/local/lib/python3.9/site-
packages/nvidia/nvtx/lib/libnvToolsExt.so.1
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:275: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> exit()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144477
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia
2025-01-09 20:04:11 +00:00
6bc17b0725 Update #graph breaks for moco benchmark (#144266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144266
Approved by: https://github.com/zou3519
2025-01-09 18:51:13 +00:00
0e02e6f95f [BE]: Remove redundant contiguous copy in torch/_decomp/decompositions (#144472)
Removes a redundant extra copy by calling contiguous. Instead, just add a memory_format flag to the dtype cast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144472
Approved by: https://github.com/awgu, https://github.com/cyyever, https://github.com/malfet
2025-01-09 18:50:00 +00:00
307ca094c9 [BE]: Remove redundant contiguous copy in flex attention (#144467)
Removes a redundant potential copy, instead use memory_format kwarg to fuse both operations into a single copy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144467
Approved by: https://github.com/awgu
2025-01-09 18:30:09 +00:00
bbec35f028 [BE]: Replace clone detach with detach clone to be more efficient (#144469)
Follow up to #144270 and fix some vulkan code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144469
Approved by: https://github.com/awgu
2025-01-09 18:28:39 +00:00
73278e6a5d easy: sort dictionary keys for inductor config when publishing (#143307)
This means we should get consistent logging strings for the same
config on different ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143307
Approved by: https://github.com/xmfan
2025-01-09 18:01:20 +00:00
84443bd61a feature_use: Remove JK from naming for feature use. (#143529)
See discussion in https://github.com/pytorch/pytorch/pull/142819 but
TL;DR, since we're loging use but not direct JK reads, it's less
confusing to use the logging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143529
Approved by: https://github.com/ezyang
2025-01-09 17:58:22 +00:00
b8f383107e Link to transformer tutorial in transformer docs (#144425)
<img width="1045" alt="Screenshot 2025-01-08 at 4 50 20 PM" src="https://github.com/user-attachments/assets/05adfecb-8a23-4c48-9a2c-50c5b3f886b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144425
Approved by: https://github.com/albanD
2025-01-09 17:42:09 +00:00
f2c1033178 [Submodule] Upgrade to Cutlass 3.6 (#144180)
Differential Revision: [D67866269](https://our.internmc.facebook.com/intern/diff/D67866269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-09 17:29:58 +00:00
1365ae859c [ROCm][CI] upgrade CI to ROCm 6.3 (#142152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142152
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-09 17:14:16 +00:00
cyy
b0be30dd79 [19/N] Fix extra warnings brought by clang-tidy-17 (#144448)
Apply more clang-tidy fixes. There was a bug introduced by #144014 due to incorrect namespace concatenation which is reverted here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144448
Approved by: https://github.com/albanD
2025-01-09 15:58:05 +00:00
1353f3beb4 [mps/inductor] Add support for fmod(). (#144449)
397 -> 395 tests failing. `static_cast<>` is because there are several overloads of `fmod()` that's otherwise ambiguous. I wonder if we should take in account NaN propagation (maybe it's not tested).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144449
Approved by: https://github.com/malfet
2025-01-09 15:47:41 +00:00
9631d1a021 [pipelining] throw error with ZB and compile (#143599)
Zero bubble wil SIGSEGV when operating on a `torch.compile`'d model so raising this error while I am still investigating the cause / design for a fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143599
Approved by: https://github.com/wconstab
2025-01-09 06:53:25 +00:00
3797143e06 Revert "[Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224)"
This reverts commit fabf2ea12e18bad3297e2810b77417d71c2a360b.

Reverted https://github.com/pytorch/pytorch/pull/144224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that some ARM tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/144224#issuecomment-2579260377))
2025-01-09 06:20:31 +00:00
6f28e466f3 [mps/inductor] Add support for tanh(). (#144443)
Fixes test_tanh() in the inductor testsuite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144443
Approved by: https://github.com/malfet
2025-01-09 06:14:03 +00:00
7f1946aa9b [aot] don't dce aten rng nodes (#144319)
FIXES https://github.com/pytorch/pytorch/issues/143431

For aot_eager backend, we dce twice in aot. The first dce errs on the side of caution and provides a restrictive dce function: 2e1ea8598f/torch/fx/experimental/proxy_tensor.py (L1173)

The second one is more aggressive: 2e1ea8598f/torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py (L185)
But this deviates from eager accuracy when rand ops are dce'd

The repro doesn't work for inductor, but that's a separate issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144319
Approved by: https://github.com/jansel
2025-01-09 05:27:49 +00:00
d4871750d9 [ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673)
This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989

Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2

Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1

Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda

Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)

Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda

Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony

Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-09 05:18:57 +00:00
0d08084f1a [Inductor] Add convolution output size checking to the meta function (#144225)
Fixes #144013

Adding a size check to the meta function, similar to which in the CUDA/CPU aten op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144225
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-01-09 04:20:06 +00:00
fabf2ea12e [Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

This PR is one of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves binary post op fusion of qlinear out of the lowering pass.
This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise`
2. Fuse `onednn.qlinear_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #143903
2025-01-09 03:27:09 +00:00
bc576355a2 Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (#137443)
We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.

This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.

Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-09 00:00:02 +00:00
8ac005ddb8 [DTensor] Add aten.view.dtype op support (#144404)
Fixes https://github.com/pytorch/pytorch/issues/144286

Viewing a tensor to a different dtype does not require any redistribution and can use the default strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144404
Approved by: https://github.com/wanchaol
2025-01-08 23:11:22 +00:00
dcc3cf7066 [BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415)
The fixes are generated by:

```bash
ruff check --fix --preview --unsafe-fixes --select=E226 .
lintrunner -a --take "RUFF,PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-08 21:55:00 +00:00
a742859fc2 [ONNX] Update images and APIs to onnx_dynamo.rst (#144358)
Update the result image of exporting, and delete the functions/class that belongs to `torch.onnx.dynamo_export`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144358
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-01-08 21:44:43 +00:00
a5164a2b18 [BE] Clean up ExecuTorch Export Docstring (#141490)
Summary: I noticed when looking at the docs for [`torch.export.load`](https://pytorch.org/docs/stable/_modules/torch/export.html#load) that it looked like there was a copy and paste error from the save command docstring since ep is not an actual parameter for load and it says "The exported program to save." This diff removes it from the docstring.

Test Plan: Automated Testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141490
Approved by: https://github.com/JacobSzwejbka
2025-01-08 21:28:58 +00:00
8c5d992772 [Pipelining] Refactor pp composability test to use faster MPCT (#144345)
* Using MultiProcessContinuousTest base class is faster (60s vs 279s for
  the full run of `test_manual_with_data_parallel` and all its
  parametrizations
* Have to move to a new file to use MPTC since it requires a different
  launcher style in `__main__`
* Propose to reorganize the composability tests anyway, since
  `test/_composable/test_composability/test_pp_composability` is an
  annoyingly long path
* rename `test_manual_with_data_parallel` to `test_pp_dp` for
  simplicity/consistency with newer test names.  (manual refers to not
  using tracer frontend, but that's not so important to call out in the
  test name)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144345
Approved by: https://github.com/H-Huang, https://github.com/mori360
2025-01-08 20:50:12 +00:00
c194e5c986 Remove extra copy torch/_prims (#144407)
updated _reshape_aten

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144407
Approved by: https://github.com/awgu
2025-01-08 20:14:48 +00:00
628acc4ace Dirichlet.mode: use dim= instead of axis= (#144402)
`axis=` is undocumented and will raise typing errors when #144197 is merged.

See: https://github.com/pytorch/pytorch/pull/144197#pullrequestreview-2537398866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144402
Approved by: https://github.com/Skylion007
2025-01-08 20:14:01 +00:00
ab1f627aa4 fix randint distribution for large max (#143787)
Fixes #ISSUE_NUMBER
Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`)
This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it.
`torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better.
`__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787
Approved by: https://github.com/eqy
2025-01-08 18:51:48 +00:00
0e1675a89b Relax aten.to restriction (#142420)
Summary: if we have a.to(b), and b has a different dtype with a, then it must be a copy. In this case, we do not need to freeze the tensor. Instead, we use torch.ops.aten._assert_tensor_metadata.default to ensure that a must not have the same dtype as b.

Fixes https://github.com/pytorch/pytorch/issues/139718

Update executorch pin to include https://github.com/pytorch/executorch/pull/7277.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_float_conversion
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_device_to_mutation_float
```

Differential Revision: D66988295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142420
Approved by: https://github.com/bdhirsh
2025-01-08 18:11:31 +00:00
768d73f692 use torch.special.xlogy to implement x_log_x (#144220)
Fixes #144279

Using `x* x.log()` does not produce the correct value when `x=0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144220
Approved by: https://github.com/Skylion007
2025-01-08 17:41:55 +00:00
cyy
d0070ca07e [18/N] Fix extra warnings brought by clang-tidy-17 (#144014)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144014
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-08 17:21:55 +00:00
373541fbf4 [BE]: Remove unnecessary copy of gradients in util (#144329)
No need to copy gradients to CPU too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144329
Approved by: https://github.com/awgu, https://github.com/cyyever
2025-01-08 16:52:15 +00:00
e14c36d3f4 Set maximum supported version of Python as 3.13 (#144396)
Same as https://github.com/pytorch/pytorch/pull/119743 Required for Release 2.6.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144396
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet
2025-01-08 16:16:10 +00:00
3068ce0337 ROCm SDPA: Ensure attn_mask has the same dtype with q (#143242)
This is required by current AOTriton's backend.

Fixes NaN when calling SDPA ME backend with `q.dtype() != attn_mask.dtype()` when training llama2 using transformers+deepspeed+pytorch

Corresponding CUDA check seems to be here:
708ce3c008/aten/src/ATen/native/transformers/cuda/attention.cu (L1331-L1336)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143242
Approved by: https://github.com/jeffdaily
2025-01-08 15:20:26 +00:00
708ce3c008 Add is_dtype_supported predicate to DeviceInterface (#144355)
Which will return true, unless dtype is bf16 by default

For MPS device it will return false if dtype is double

Check that it works by refactoring `test_inf` that should expect TypeError raised if invoked with unsupported dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144355
Approved by: https://github.com/jansel, https://github.com/dcci
2025-01-08 13:59:46 +00:00
8fc0ffe54b [mps/inductor] Add support for rsqrt(). (#144374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144374
Approved by: https://github.com/malfet
2025-01-08 13:58:05 +00:00
f700035090 [3.13t] use sysconfig to check for Python nogil builds (#144361)
`sys._is_gil_enabled()` wasn't working in certain cases, according to @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144361
Approved by: https://github.com/atalman
2025-01-08 13:00:32 +00:00
a5051a9521 Update torch.masked.mean to upcast dtype for bool tensors (#139999)
When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated.

The below example shows how the incorrect result occurs:
```
a = torch.tensor([True, True])
count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2
total = torch.sum(a, dtype=torch.bool) # True (1)
mean = total / count # 0.5
```

This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999
Approved by: https://github.com/cpuhrsch
2025-01-08 10:35:19 +00:00
60a505022f [AMD] SDPA internal changes (#144320)
Summary: All the internal changes needed to enable flash attention w/ SDPA in fbcode.

Test Plan:
```
TORCH_ROCM_FA_PREFER_CK=1  buck run -m rocm621  mode/opt-amd-gpu scripts/xdwang/example:sdpa

+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Math Time (µs) |   xformers Time (µs) |   Flash TFlops |   Math TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+
|            1 |              4096 |      32 |         64 |           455.552 |          7748.76 |              513.449 |        301.698 |       17.7369 |           267.678 |                17.0096 |                   15.0916 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      16 |        128 |           329.971 |          4741.11 |              386.049 |        416.519 |       28.9888 |           356.014 |                14.3683 |                   12.2811 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |         64 |          1455.76  |         31869.6  |             1665.49  |        377.642 |       17.2501 |           330.087 |                21.8921 |                   19.1353 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |        128 |          1265.77  |         18972.8  |             1479.48  |        434.325 |       28.976  |           371.588 |                14.9891 |                   12.824  |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |         64 |          5732.99  |        121861    |             6816.77  |        383.573 |       18.0453 |           322.59  |                21.2562 |                   17.8767 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |        128 |          4749.69  |         73776.4  |             5404.03  |        462.982 |       29.8066 |           406.923 |                15.5329 |                   13.6521 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Math Time (µs) |   xformers Time (µs) |   Flash TFlops |   Math TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+
|            1 |              4096 |      32 |         64 |           1615.41 |          8342.67 |              1822.72 |        212.7   |       41.1855 |           188.508 |                5.16443 |                   4.57705 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      16 |        128 |           1357.97 |          5943.53 |              1432.34 |        253.022 |       57.8104 |           239.886 |                4.37676 |                   4.14953 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |         64 |           5556.5  |         31726.7  |              6502.17 |        247.348 |       43.3197 |           211.374 |                5.70984 |                   4.8794  |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |        128 |           5186    |         22529.4  |              5590.36 |        265.019 |       61.0044 |           245.85  |                4.34427 |                   4.03004 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |         64 |          22527.7  |        130413    |             26527.6  |        244.035 |       42.155  |           207.239 |                5.789   |                   4.91613 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |        128 |          18347.9  |         87553.2  |             20358    |        299.628 |       62.791  |           270.044 |                4.77184 |                   4.30068 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+

```

Reviewed By: leitian, feikou, yoyoyocmu, sijiac

Differential Revision: D67262726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144320
Approved by: https://github.com/jianyuh, https://github.com/eqy, https://github.com/leitian
2025-01-08 09:29:28 +00:00
7d9f26de05 Revert "Unskipped multiple inductor tests for ROCm (#143581)"
This reverts commit e05d67790ee4a53c310322829631c000f0ac2985.

Reverted https://github.com/pytorch/pytorch/pull/143581 on behalf of https://github.com/huydhn due to There is some tests failing on ROCm jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143581#issuecomment-2577163274))
2025-01-08 09:15:14 +00:00
aaf56152ea [cpu/sorting] Throw an error when trying to sort complex numbers. (#144113)
It doesn't really make sense to sort complex numbers as they are not comparable.

Fixes #129296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144113
Approved by: https://github.com/malfet
2025-01-08 05:15:36 +00:00
78eded8e00 [ONNX] Use torch.export.Dim.AUTO in dynamo_export (#144356)
Align to the changes in https://github.com/pytorch/pytorch/pull/143158
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144356
Approved by: https://github.com/justinchuby
2025-01-08 05:00:16 +00:00
90e81a157a Migrate from Tuple -> tuple in torch/utils/data (#144255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144255
Approved by: https://github.com/andrewkho
2025-01-08 04:09:45 +00:00
8ccf3f6f3f [dynamo][easy] Move dict tests to test_dicts.py (#144165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144165
Approved by: https://github.com/jansel
ghstack dependencies: #143997
2025-01-08 03:56:33 +00:00
2ac41404a8 [dynamo][dicts] Guarding lazily on dict keys (#143997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997
Approved by: https://github.com/jansel
2025-01-08 03:56:33 +00:00
e05d67790e Unskipped multiple inductor tests for ROCm (#143581)
All of them should be fine to run now after the triton fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581
Approved by: https://github.com/jataylo, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-08 03:55:33 +00:00
28b4992e7a Set prop_kind to forward_inference when grad is not needed for mkldnn_convolution_pointwise (#142855)
`prop_kind` of MKLDNN convolution is always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether grad is needed. Setting `prop_kind` to `dnnl_forward_inference` for mkldnn_convolution_pointwise could have better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142855
Approved by: https://github.com/jgong5
2025-01-08 02:22:06 +00:00
f8fcb9e7d3 [Quant][Inductor][X86] Separate unary post op fusion and lowering for qlinear (#143903)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

This PR is the first of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves unary post op fusion of qlinear out of the lowering pass.
This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise`
2. Fuse `onednn.qlinear_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143903
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2025-01-08 01:55:53 +00:00
094ca3154d Fix torch._refs.tensor error with empty list (#143461)
Fixes #143216

**Test Result**

**Before**

```python
>>> import torch
>>> torch._refs.tensor([])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6614, in tensor
    new_tensor = _internal_new_from_data(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6596, in _internal_new_from_data
    tensor = _recursive_build(inferred_scalar_type, data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6545, in _recursive_build
    return torch.stack([_recursive_build(scalarType, item) for item in seq])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects a non-empty TensorList

```

**After**

```python
>>> torch._refs.tensor([])
tensor([])
>>> torch._refs.tensor([], device='cuda')
tensor([], device='cuda:0')
```

```bash
$ pytest test/test_tensor_creation_ops.py -k test_refs_tensor
```

![image](https://github.com/user-attachments/assets/5be4c17a-bea6-4b7b-bec1-b4fcb417a8cd)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/e8f88f41-78ac-4337-b53f-2e524de2bec0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143461
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-01-08 01:29:00 +00:00
9e6a6389ce [functorch] clean up asserts in test_dims.py (#144276)
For better debuggability of issues encountered in e.g., #141730 when trying to migrate to python 3.12/3.13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144276
Approved by: https://github.com/Skylion007
2025-01-08 01:21:40 +00:00
013c796b1e Eliminate c10::optional usage in PyTorch (#144346)
Differential Revision: D67907427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144346
Approved by: https://github.com/hl475
2025-01-08 01:14:04 +00:00
f002825e1e added __add__ and __mul__ hints to torch.Size (#144322)
Fixes #144218

`Size` returns `Size`, whereas `tuple` returns `tuple`: 9f28171658/stdlib/builtins.pyi (L985-L988)

- Use `SupportIndex` instead of `int` in `__getitem__` (supported at runtime)
- `Size.__add__` overrides  `tuple.__add__`, the latter supports adding tuples on non-integral type.
- Added typing unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144322
Approved by: https://github.com/Skylion007
2025-01-08 01:02:11 +00:00
06ea81336f [Inductor UT] Remove excepted failure for aoti test_fft_c2c (#144238)
Since #143223 enabled runtime dispatch for fft_c2c in AOTI mod, for XPU, we can fallback fft_c2c which has no XPU implementation to CPU and pass the case now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144238
Approved by: https://github.com/jansel
2025-01-08 00:49:32 +00:00
96f4abba17 [dtensor] move all tests to distribute/tensor folder (#144166)
as titled, mainly moving files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166
Approved by: https://github.com/Skylion007
2025-01-08 00:32:33 +00:00
7c9cf287c2 [ONNX] Handle list values as 0d inputs (#144343)
Handle list values as 0d inputs instead of 1d, as the `SymInt`s are expected to be 0d tensors in ONNX.

This PR reshapes int64 values into 1D tensors in a list, assuming they are 0D tensors initially.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144343
Approved by: https://github.com/gramalingam, https://github.com/titaiwangms
2025-01-08 00:15:50 +00:00
9ee242213b [RFC] Introduce cache hot loading APIs (a.k.a. "Mega-cache") (#143341)
This PR essentially introduces two new APIs
* torch.compiler.save_cache_artifacts
* torch.compiler.load_cache_artifacts

which aim to create a mega cache experience where the user can start collecting cache artifacts, and later call the save API to fetch them. In the next attempt, the user can "hot load" the cache artifacts via the load function.

This bundling approach reduces the need to rely on porting individual files one by one, or relying on many network requests.

Note that these APIs CANNOT log to structured logging as these functions will be called before and after compilation, as opposed to during compilation. Due to this limitation, the API returns a struct that the user can log with.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143341
Approved by: https://github.com/jansel
2025-01-07 23:13:24 +00:00
c2c50d5f00 Fixed doc where more than one device specified since only one device is used (#17553) (#144043)
Fixes #17553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144043
Approved by: https://github.com/soulitzer
2025-01-07 23:06:52 +00:00
430d54ee20 [Dynamo] Add functorch C++ bindings as in graph functions (#144309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144309
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306, #144307, #144308
2025-01-07 22:25:01 +00:00
d146763f6f [Dynamo] Inline functions in torch._ops (#144308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144308
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306, #144307
2025-01-07 22:25:01 +00:00
242a4a3f83 [Dynamo] Inline functions in torch._functorch.pyfunctorch (#144307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144307
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306
2025-01-07 22:24:53 +00:00
4417be65e5 [Dynamo] Inline functions in torch._functorch.autograd_function (#144306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144306
Approved by: https://github.com/williamwen42
2025-01-07 22:24:46 +00:00
3beb7006dd c10::optional -> std::optional in a few places (#144340)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144340
Approved by: https://github.com/malfet
2025-01-07 21:09:39 +00:00
f4969c8235 fix torch.compile + ddp + non-reentrant AC pack hook firing count (#144271)
FIXES https://github.com/pytorch/pytorch/issues/144035

In order to preserve hook firing semantics, we disabled pack/unpack hooks for torch.compile: https://github.com/pytorch/pytorch/pull/123196. In DDP under torch.compile, there's this other callsite that we need to disable hooks for

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144271
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2025-01-07 21:08:52 +00:00
861b65fe74 [Easy] Fix linalg.norm hint message typo (#144323)
Fixes #136454

**Test Result**

**Before**

```python
>>> import torch
>>> from torch import linalg
>>>
>>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]])
>>>                            # ↓ ↓ ↓ ↓ ↓
>>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2]
>>>                            # ↓ ↓ ↓ ↓ ↓
>>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2]

```

**After**

```python
>>> import torch
>>> from torch import linalg
>>>
>>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]])
>>>                            # ↓ ↓ ↓ ↓ ↓
>>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2]
>>>                            # ↓ ↓ ↓ ↓ ↓
>>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144323
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2025-01-07 20:34:16 +00:00
d38af6e8bc [ca] dedup node names when AOT bwd graph is reused multiple times (#144202)
This error started popping up in HUD CA benchmarks:
```python
 File "/data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py", line 371, in dce
    self.fx_tracer.graph.eliminate_dead_code(is_impure)
  File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1862, in eliminate_dead_code
    self.lint()
  File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1753, in lint
    raise RuntimeError(f"Node redefined name {node.name}!")
RuntimeError: Node redefined name aot0_expand!
```

We added CA initial capture's renaming (https://github.com/pytorch/pytorch/pull/133148) to help debug issues with AOT backward, but it errors out when we have multiple instances of the same AOT backward. This likely only showed up now because of increased hierarchical graph reuse. I fix it by adding a postfix counter to the node name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144202
Approved by: https://github.com/bdhirsh, https://github.com/jansel
2025-01-07 20:23:09 +00:00
72e8f34715 [AoTI Minifier] UX Improvement (#143330)
Summary:
- When a user specify `TORCHINDUCTOR_MAX_AUTOTUNE=1` env variable, we add `config.max_autotune=True` to the generated minifier_launcher
- We should do this to other inductor configs as well in a followup Diff

Currently in dynamo and aoti minifier, if a config is overwritten by an env variable, the config will not show up in the config list in the minifier_launcher.py file. As a result, when running the minifier_launcher, they need to re-apply the same env variable.
 This is:
1) not convenient for the users
2) if they copy-paste the minifier_launcher.py to us without including the env variable, we could be confused and not able to reproduce the error.

Underlying implementation change:

- Add `env_default` parameter to `codegen_config()`. If set, configs overriden by the env are not considered default.

Test Plan:
```
 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config
```

Differential Revision: D67299312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143330
Approved by: https://github.com/jansel, https://github.com/eellison
2025-01-07 20:04:19 +00:00
096cb874d3 remove allow-untyped-defs from torch/_prims/executor.py (#144233)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144233
Approved by: https://github.com/Skylion007
2025-01-07 19:40:40 +00:00
0aa74d0ab9 Skip L1 cache for single-use buffers (#143115)
### 1. Synopsis

Adds `cache_modifier='.cg'` optional argument into `tl.load` instructions in the inductor-generated triton code for selected buffers.

It makes the `tl.load` instruction to skip  the L1 cache for short-lived / non-reused data.

### 2. Using the feature

This feature is experimental and disabled by default.  It can be enabled by setting the environmental variable `TORCHINDUCTOR_SKIP_L1` equal to `1`.

### 3. Results

For a simple pointwise addition kernel:
```python
@torch.compile
def add_dummy(x: torch.Tensor, y: torch.Tensor):
    return x+y
```
we get (bandwith performance is in GB/s):

(a) feature DISABLED:
![image](https://github.com/user-attachments/assets/6caaf775-f083-4943-a61f-8a1bcb154387)

(b) feature ENABLED:
![image](https://github.com/user-attachments/assets/9286be7d-c6ff-4a33-a023-77cb5cc87ff6)

### 4. Caveats

The feature boost is only available when using
```python
torch._dynamo.config.cache_size_limit = 64 # or any other sufficiently big number..
torch._dynamo.config.automatic_dynamic_shapes = False   # use static shapes
```
When using (the default) dynamic shapes, only 1-2 triton kernels are generated with non-optimal block-sizes for
*all* the cases (vector sizes), hiding any perf benefit from skipping the L1 cache.

In the static case, as an optimal block size is generated for each vector size, the perf benefit of skipping the L1 cache becomes visible.

This block-size optimization issue is a larger problem in pytorch inductor and is outside the scope of this feature.

### 5. References

- [tl.load](https://triton-lang.org/main/python-api/generated/triton.language.load.html#triton.language.load)
- [cache operators](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143115
Approved by: https://github.com/jansel
2025-01-07 19:35:40 +00:00
355b0bc7e3 [typing] Add type hints to @property and @lazy_property in torch.distributions. (#144110)
Fixes #76772, #144196
Extends #144106

- added type annotations to `lazy_property`.
- added type annotation to all `@property` and `@lazy_property` inside `torch.distributions` module.
- added simply type-check unit test to ensure type inference is working.
- replaced deprecated annotations like `typing.List` with the corresponding counterpart.
- simplified `torch.Tensor` hints with plain `Tensor`, otherwise signatures can become very verbose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144110
Approved by: https://github.com/Skylion007
2025-01-07 19:27:36 +00:00
aa69d73e6b [ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor (#144007)
Fixes #136291

This PR is to fix the `invalid configuration argument` problem happened on ROCm when input is a large tensor when calling `torch.layer_norm`.

```
 File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2573, in layer_norm
    return torch.layer_norm
RuntimeError: HIP error: invalid configuration argument
```

After investigation, I found that the reason why this error happened is: The amd compute language runtime checks whether  `gridDim.x * blockDim.x` is greater than `std::numeric_limits<uint32_t>::max()` or not. If yes, it will error out with the "invalid configuration argument" message.

The fix is to split the whole task to several chunks so that each chunk will not trigger the failure condition. This will ensure the correctness and completeness given the current kernel implementation logic of `vectorized_layer_norm_kernel`.

Also added a largeTensor layer_norm unit test `test_layer_norm_large_tensor` with the same shape `[16, 3000, 3000, 16]` as the one used by the pytorch issue #136291 so that the unit test can check the expected output value to ensure correctness.

The future work may include performance optimization of layer_norm and CK layer_norm integration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144007
Approved by: https://github.com/eqy
2025-01-07 19:17:02 +00:00
6c54963f75 Revert "[dtensor] move all tests to distribute/tensor folder (#144166)"
This reverts commit 2e1ea8598f477322965c28fb52e6e5f53876d8dd.

Reverted https://github.com/pytorch/pytorch/pull/144166 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but inductor/test_compiled_autograd needs to be updated ([comment](https://github.com/pytorch/pytorch/pull/144166#issuecomment-2575969871))
2025-01-07 18:31:36 +00:00
e4a05dec0f [BE][Ez]: Fix docs recommending inefficient tensor op order (#144270)
`detach().clone()` is faster than `.clone().detatch()` since the gradients are not cloned. Let's update all the documentation and tests so that users do not use the inefficient op ordering.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144270
Approved by: https://github.com/awgu, https://github.com/XuehaiPan
2025-01-07 17:31:32 +00:00
8d35333498 [CD] Aarch64 builds should not override OVERRIDE_PACKAGE_VERSION envvar (#144285)
Currently our nightly aarch64 binaries have correct suffixes +cpu or +cu126. But release binaries are missing these suffixes. Hence to correct this, make sure are nightly and release binaries are consistent, I propose this change.

I see that override is already set correctly in release workflow:
https://github.com/pytorch/pytorch/actions/runs/12383179841/job/34565381200

For CPU:
```
OVERRIDE_PACKAGE_VERSION="2.6.0+cpu"
```

For CUDA:
```
OVERRIDE_PACKAGE_VERSION="2.6.0+cu126"
```

The removed code will set : OVERRIDE_PACKAGE_VERSION="2.6.0" for both cuda and cpu builds for release binaries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144285
Approved by: https://github.com/malfet, https://github.com/tinglvv
2025-01-07 12:50:54 +00:00
12fdb93ebd fix non-strict placeholder naming with kwargs (#144278)
Fixes https://github.com/pytorch/pytorch/issues/143732

Differential Revision: [D67872055](https://our.internmc.facebook.com/intern/diff/D67872055/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144278
Approved by: https://github.com/yushangdi, https://github.com/pianpwk
2025-01-07 11:22:09 +00:00
c3b28491c8 [caffe2] Add AVX512 support for box_cox operator (#143627)
Summary:
Reuse templetized implementation of box_cox caffe2 operator.
* Duplicate .cc file of AVX2
* change intrinsics functions to use AVX512 instructions
* override templates
* extend the caller to use new methods
* guard AVX512 with a gflag to allow smooth transition

Differential Revision: D67433457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143627
Approved by: https://github.com/hl475
2025-01-07 09:54:39 +00:00
bf7747e935 Tests Generelization for multiple accelerator devices (#139184)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209  Merged now.
There was a #135242  for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501  @zeshengzong Please review the changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2025-01-07 09:04:38 +00:00
2e1ea8598f [dtensor] move all tests to distribute/tensor folder (#144166)
as titled, mainly moving files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166
Approved by: https://github.com/Skylion007
2025-01-07 06:45:14 +00:00
d0f5df83a5 [ca] add test_dtensor_compile.py to compiled autograd tests (#144107)
more than half the tests use autograd, pass rate 19/26

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107
Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel
2025-01-07 05:16:14 +00:00
fcf9dc3b11 Migrate from Tuple -> tuple in benchmarks (#144259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259
Approved by: https://github.com/yanboliang
2025-01-07 04:09:52 +00:00
2e42be0595 Use random64 in Fischer-Yates algorithm for large N (#143682)
Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/malfet
2025-01-07 03:48:56 +00:00
551f104153 [mps/inductor] Add support for sign(). (#144298)
Drive-by fix of a test name while I was at it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144298
Approved by: https://github.com/malfet
2025-01-07 03:33:26 +00:00
a3ab27b8e0 Migrate from Tuple -> tuple in torch/_inductor (#144264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264
Approved by: https://github.com/eellison
2025-01-07 03:27:27 +00:00
778d953951 Revert "[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011)"
This reverts commit 24ac87392bc4e0060a90483643f7df5611988ae5.

Reverted https://github.com/pytorch/pytorch/pull/144011 on behalf of https://github.com/malfet due to Not sure what is going on, but lots of builds are failing ([comment](https://github.com/pytorch/pytorch/pull/144011#issuecomment-2574317669))
2025-01-07 03:24:01 +00:00
f4e9aebbcc Revert "Update torch.masked.mean to upcast dtype for bool tensors (#139999)"
This reverts commit 0742b2366e7ba65e0437a17b09a3bb0804ae51ea.

Reverted https://github.com/pytorch/pytorch/pull/139999 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a landrace and fails a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/139999#issuecomment-2574283986))
2025-01-07 02:42:55 +00:00
168c2cb3f3 remove allow-untyped-defs from torch/nn/utils/_deprecation_utils.py (#144231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144231
Approved by: https://github.com/albanD
2025-01-07 02:22:22 +00:00
24ac87392b [AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-07 02:15:42 +00:00
73a6a40346 [Inductor][CPP] Fix outer loop fusion buffer removed (#144243)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest`

-  `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)`

- `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)`

Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243
Approved by: https://github.com/jgong5
2025-01-07 01:17:46 +00:00
2f6f13562f [BE] Actually suppress vmap warning from gradcheck (#144287)
This is the much safer change compared to https://github.com/pytorch/pytorch/pull/144283

Before:
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd
/data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap.
  result = vmap(vjp)(torch.stack(grad_outputs))
/data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap.
  result = vmap(vjp)(torch.stack(grad_outputs))
.
----------------------------------------------------------------------
Ran 1 test in 0.028s
```

(the env vars aren't necessary)

After:
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd
.
----------------------------------------------------------------------
Ran 1 test in 0.028s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144287
Approved by: https://github.com/cyyever, https://github.com/soulitzer
2025-01-07 01:11:41 +00:00
61c0a3d1cb Fix lint in test_provenance_tracing.py (#144296)
Regression introduced by https://github.com/pytorch/pytorch/pull/143684/ that somehow did not surface on PR CI

IMO this also makes two branches of the test(compile vs aoti) more readable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144296
Approved by: https://github.com/xw285cornell, https://github.com/huydhn
2025-01-07 01:11:38 +00:00
48153c72b2 [Intel XPU] enable kineto for XPU Windows. (#144034)
This PR will turn on `kineto` on Windowx XPU wheel build.

For `kineto` on Windows XPU, the build time dependencies list:
1. Intel PTI, it contained by oneAPI 2025+.
2. Level zero SDK: https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip

**Note:**
We need to manual setup level zero SDK on build time, so we will turn off kineto build on Windows XPU by default. It is in order to avoid developer occurred build issue.
After add level zero SDK include path to `INCLUDE` env_var path. We can add an env_var `XPU_ENABLE_KINETO` to turn on it.

For runtime dependency:
1. Intel-pti pipy package. @chuanqi129 will follow up on further PR.

Local tested the nightly binary:

<img width="1909" alt="image" src="https://github.com/user-attachments/assets/7dfaa7bc-e8ed-40b8-bc71-f91a3df3b95f" />

TODO: @chuanqi129 will submit a following PR to add `intel-pti` as dependency and turn on env_var `XPU_ENABLE_KINETO` for nightly build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144034
Approved by: https://github.com/chuanqi129, https://github.com/zejun-chen, https://github.com/EikanWang, https://github.com/sraikund16
2025-01-07 01:11:25 +00:00
0742b2366e Update torch.masked.mean to upcast dtype for bool tensors (#139999)
When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated.

The below example shows how the incorrect result occurs:
```
a = torch.tensor([True, True])
count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2
total = torch.sum(a, dtype=torch.bool) # True (1)
mean = total / count # 0.5
```

This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999
Approved by: https://github.com/cpuhrsch
2025-01-07 00:26:59 +00:00
f013cfee38 [TreeSpec] Support enum in defaultdict (#144235)
Summary: Followup from D66269157, add support for enum in defaultdict.

Test Plan: Added unit test

Differential Revision: D67832100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144235
Approved by: https://github.com/henrylhtsang, https://github.com/houseroad
2025-01-07 00:10:46 +00:00
c68c38c673 Support getattr for tensor subclasses in pre-dispatch export via patching tensor.getattr (#143946)
Previous discussion: https://github.com/pytorch/pytorch/pull/143671#issuecomment-2560112499 and https://github.com/pytorch/pytorch/pull/143671

Differential Revision: [D67693609](https://our.internmc.facebook.com/intern/diff/D67693609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143946
Approved by: https://github.com/bdhirsh
2025-01-06 23:55:50 +00:00
66059f80d2 Migrate from Tuple -> tuple in torch/profiler (#144257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144257
Approved by: https://github.com/sraikund16
2025-01-06 23:34:14 +00:00
5ccbfffd11 update expected results (#144274)
this PR f6488d85a0 made it +1.3% < 1.5%.
once we have the API from dev infra and change the test this wont be happening.

<img width="364" alt="Screenshot 2025-01-06 at 11 01 15 AM" src="https://github.com/user-attachments/assets/401b2d11-e400-49d6-b6f9-8e10ca141cb0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144274
Approved by: https://github.com/oulgen, https://github.com/anijain2305
2025-01-06 23:18:21 +00:00
f879a6982d Enhance provenance tracing unit test to cover torch.compile() (#143684)
Summary: Follow up as title.

Test Plan:
```
buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing
```

Differential Revision: D67543556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143684
Approved by: https://github.com/yushangdi
2025-01-06 22:58:04 +00:00
301b9c8a90 Fix PythonMod printing (#144078)
Fixes #144075
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144078
Approved by: https://github.com/anijain2305
2025-01-06 22:52:34 +00:00
edbda2fad8 remove allow-untyped-defs from torch/export/_remove_auto_functionalized_pass.py (#144230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144230
Approved by: https://github.com/Skylion007
2025-01-06 22:23:19 +00:00
d75ffccd0a Migrate from Tuple -> tuple in torch/_export (#144262)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144262
Approved by: https://github.com/avikchaudhuri
2025-01-06 22:20:26 +00:00
00c18c8882 Make all-reduce input contiguous in distributed.nn.all_reduce (#144267)
Fixes https://github.com/pytorch/pytorch/issues/144060

I confirmed that the unit test fails without the `.contiguous()` fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144267
Approved by: https://github.com/wz337, https://github.com/Skylion007, https://github.com/fduwjj
2025-01-06 22:20:04 +00:00
16c1b1048b [MPSInductor] Add nan constant generation (#144281)
If val is not equal to self, it's a nan (which is spelled as `NAN` in Metal)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144281
Approved by: https://github.com/atalman, https://github.com/dcci
2025-01-06 22:13:23 +00:00
7d5249dbc2 [EZ][BE] Fix E226 flake8 violation (#144282)
Not sure why CI did not complain about it, but it my local runs it clearly says
```
Advice (FLAKE8) E226
    missing whitespace around arithmetic operator
    See https://www.flake8rules.com/rules/E226.html

        268  |            with code.indent():
        269  |                if len(idx_var_names) > 1:
        270  |                    for idx, name in enumerate(idx_var_names):
    >>> 271  |                        code.writeline(f"auto {name} = thread_pos.{chr(120+idx)};")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144282
Approved by: https://github.com/Skylion007
2025-01-06 22:12:21 +00:00
5d88002af6 [inductor] Avoid specializing over symbolic value during constant folding (#144176)
Fixes #143667. See more context in the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144176
Approved by: https://github.com/jansel, https://github.com/eellison
2025-01-06 21:50:17 +00:00
729b7c0a84 [TGIF][Easy] Slightly improve the logging for tgif split pass (#143771)
Summary:
1. Added more details for some of the assert statements.
2. Moved assert statements to use tgif_assert

Test Plan: all unit tests should pass

Reviewed By: jingsh

Differential Revision: D67608251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143771
Approved by: https://github.com/jingsh
2025-01-06 21:00:15 +00:00
b5cf8e2460 [BE]: Remove redundant copy in torch chunk shard (#144269)
Fixes an issue noticed in recent all_gather PR. Some parts of the codebase have a double copy with `clone().contiguous()` which could be fused into a single copy op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144269
Approved by: https://github.com/awgu
2025-01-06 20:52:49 +00:00
1b8a943011 remove allow-untyped-defs from ao/nn/sparse/quantized/utils.py (#144232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144232
Approved by: https://github.com/Skylion007
2025-01-06 19:54:27 +00:00
6d445bef0c [ROCm][NFC] Fix condition for small tensor tuning (#144087)
Fix condition for small tensor tuning to not impact non-ROCm compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144087
Approved by: https://github.com/jeffdaily
2025-01-06 19:40:22 +00:00
c62873a09a Fix incorrect python expression (#143675)
Summary:
This expression would return True always, causing the input to be deleted
on error, even for non-write modes:

```
>>> bool("w" or "+" or "a" in "rb")
True
```
Test Plan: new test in test_fsspec.py

Differential Revision: D67537234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143675
Approved by: https://github.com/mayankgarg1990, https://github.com/huydhn
2025-01-06 19:04:26 +00:00
e3aac7f8a0 detect fake mode in proxy_tensor creation in make_fx (#144168)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/143742

A FakeTensorMode may already exist when we are setting the "val" meta of a proxy tensor. We should detect existing FakeTensorMode before creating a new one.

Otherwise, we could cause an error when using `detect_fake_mode` later, because there are now multiple FakeTensorModes existing.

Test Plan: The error in https://github.com/pytorch/pytorch/issues/143742

Differential Revision: D67813111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144168
Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan
2025-01-06 19:02:08 +00:00
e56768f030 [MPS] Fix bitwise shifts for uint8 (#144251)
Previosly all bitwise operations were aliased to the same type, but this is wrong for shift ops

Rather than building an overly complex logic, let's just instantiate using shared `scalarToMetalTypeString` helper function

Fixes https://github.com/pytorch/pytorch/issues/144190
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144251
Approved by: https://github.com/Skylion007
ghstack dependencies: #144249, #144250
2025-01-06 18:27:16 +00:00
aa14fcd96c Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit e141cb9c34e5e96ca47ea69b565bc4fd9c8f34c1.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))
2025-01-06 18:15:52 +00:00
ebeb433e73 [BE] Fix + parametrize test_min_max_nan_propagation (#144250)
- `dtype` was not passed as argument to `torch.rand` before
- Condition bfloat16 testing on MacOS14+
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144250
Approved by: https://github.com/Skylion007
ghstack dependencies: #144249
2025-01-06 17:49:41 +00:00
11a0663eeb [BE] Parametrize test_min_max (#144249)
It's better to have one unit test per dtype rather a combined one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144249
Approved by: https://github.com/Skylion007
2025-01-06 17:49:41 +00:00
d65a50ef34 Fix subclass unwrapping bug (#143945)
I noticed a small bug in tensor subclass unwrapping logic. cc @IvanKobzarev
It seems easier if we just implement it recursively so that it is easier to track the inner attrs to corresponding plain tensors and both aot_autograd and fake_tensor implement subclass unwrapping recursively.

Differential Revision: [D67693610](https://our.internmc.facebook.com/intern/diff/D67693610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143945
Approved by: https://github.com/IvanKobzarev
2025-01-06 17:38:19 +00:00
5c783bf410 [BE][Ez]: Update CUDNN Frontend submodule to 1.9.0 (#144200)
* Update CUDNN Frontend to 1.9.0, which include some API improvements, new features, and bugfixes. This is a header only lib fix so should be pretty straight forward.
* Nicest feature is that it now logs / print warnings when the CUDNN compiled version does not match the dynamically loaded one
* Fixes corrupted / truncated log lines from being printed by CUDNN Frontend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144200
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-01-06 17:33:38 +00:00
c8713e659a fix memleak, detach instead of clone to not drag around graph (#144154)
Thanks @clee2000 for bringing the memleak to my attention: https://github.com/pytorch/pytorch/actions/runs/12549765082/job/34996244798.

This memleak in the test was caused by the differentiable flavors. Because we had param.clone() and param persisted outside the for loop, the autograd graph would continue growing for each optimizer.step instead of being deleted after the optim input was used up.

To clarify, I had still expected (and still do expect) the test to fully clean everything up once the test is over, but I didn't get the chance to look into why that's not the case. This change would preliminarily unblock this particular test from failing the memleak CI.

Use detach instead of clone, which is...cheaper anyway :D since a detach I've learned from @soulitzer is a view with requires_grad=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144154
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD
2025-01-06 17:09:00 +00:00
e222dd5d25 Rewrite _reparametrize_module to use contextmanager (#138203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203
Approved by: https://github.com/zou3519
ghstack dependencies: #136033, #140604
2025-01-06 16:56:22 +00:00
4c8d661348 Set enable_trace_contextlib_contextmanager flag to True (#140604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604
Approved by: https://github.com/zou3519
ghstack dependencies: #136033
2025-01-06 16:56:22 +00:00
defbf0d339 [DTensor] Add strategy for _scaled_mm (#143760)
This is done by copying the one for a regular mm, and enforcing that the scales have the same sharding scheme as their respective operands. This works because scales are 2-d tensors that must "broadcast" to the operands. This broadcasting is trivial when scales have dimensions of 1 or N, which is the only options we currently support.

Note, however, that after this PR scales will be allowed to have the mesh's world size as a dimension (in certain cases). This works because, when mapped to the local shard, it becomes a dimension of 1, which can be handled by the operator. Note that when using row-wise _scaled_mm for tensor (sequence) parallelism, this situation arises naturally!

Because of these specificities, the test is rather complex, as it specifically tests all these behaviors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143760
Approved by: https://github.com/tianyu-l
2025-01-06 16:35:47 +00:00
d4609af1ca Propagate callable parameter types using ParamSpec (#142306) (#144047)
Fixes #142306

This PR includes typing improvements and refactoring for the following files:
- __init__.py
- decorators.py
- _ops.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144047
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2025-01-06 16:16:18 +00:00
cyy
9225f149eb Enable clang-analyzer checks of Clang-tidy (#144222)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144222
Approved by: https://github.com/Skylion007
2025-01-06 15:44:45 +00:00
bba672e117 [docs/export] update dynamic_shapes docs (#142510)
https://pytorch.org/docs/stable/export.html dynamic_shapes section formatting is messed up, fix & update documentation to be more user-friendly.

Happy accepting nits :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142510
Approved by: https://github.com/yushangdi
2025-01-06 14:12:34 +00:00
d85ae4be73 Update slow tests (#144236)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144236
Approved by: https://github.com/pytorchbot
2025-01-06 11:19:09 +00:00
a8e97d9d4d fix torch.acos and torch.asin for torch.complex datatypes on CPU (#134838)
Fix https://github.com/pytorch/pytorch/issues/134487, https://github.com/pytorch/pytorch/issues/138327.

These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `asin`. For correctness, I temporarily fallback the implementation of `asin `to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134838
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-06 06:17:39 +00:00
e1622dca7a Fix duplicate pattern error (#139321)
vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline:

For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321
Approved by: https://github.com/shunting314
2025-01-06 05:04:59 +00:00
cb5fa17e44 Revert "[ca] add test_dtensor_compile.py to compiled autograd tests (#144107)"
This reverts commit 67f85ccdcf56894d653b4d37cd7651eefa0ddf8c.

Reverted https://github.com/pytorch/pytorch/pull/144107 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/144107#issuecomment-2572209717))
2025-01-06 03:30:22 +00:00
c9ef98478a [mps/BE] Enable a test that now passes. (#144198)
After the implementation of floordiv in 464b50dbd7 landed, this now passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144198
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-06 03:14:21 +00:00
23e2953cd3 [mps/inductor] Add support for floor(). (#144195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144195
Approved by: https://github.com/jansel
2025-01-06 02:07:17 +00:00
d71f111109 [Inductor][CPP] Fix Inductor integer avg pool (#144059)
Fixes #143738. Currently the scaler for averaging is rounded to 0 if dtype is an integer, resulting to all-zero output. This fix uses `truediv` instead for integer cases.

## Test
```bash
pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool1d_cpu_int64
pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool2d_cpu_int64
pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool3d_cpu_int64
pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_local_response_norm_cpu_int64
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144059
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-01-06 01:26:53 +00:00
3d3a07963f [reland][attempt2][AMD] Turn on TF32 for aten::mm (#144145)
Summary:
https://github.com/pytorch/pytorch/pull/143549 was reverted due to some
internal/oss tooling issue. Relanding.

hipblaslt supports TF32, so adding the support.
Original PR https://github.com/pytorch/pytorch/pull/139869

Test Plan: CI

Differential Revision: D67785496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144145
Approved by: https://github.com/jianyuh
2025-01-06 00:37:01 +00:00
9f94710e48 Update core.py to fix typo (#144201)
dype -> dtype

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144201
Approved by: https://github.com/Skylion007
2025-01-05 18:20:52 +00:00
51a37a42e0 [inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#143141)
Fixes #143102

Addresses 2 problems relating to dynamic batch size in BMM autotuner:
1. With dynamic batch size, when the input is a sympy Mult expression, such as `s0*8` which occurs in many dynamo benchmark models. We address this by using `size_hints` to solve for any expressions. This is safe since this section of the code is only called to generate inputs for benchmarking.
2. Some epilogue nodes may use the dynamic batch size as part of the codegen, for example when an input to the epilogue node is transposed and has dynamic batch size in the stride of other dimensions. When these epilogue nodes exist, if the sizevar is not already present in the `kernel.args`, it will create a new sizevar with a name. It is possible that subsequent calls to `def_kernel` could overwrite this variable name, so to avoid this we pass all the sizevars as `extra_sizevars` to the calls to `def_kernel` for the GEMM functions, so no variable renaming happens later in the BMM definition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143141
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5
2025-01-05 18:02:37 +00:00
f6488d85a0 [dynamo][user-defined] Remove __getattribute__ checks and add getsetdescriptor (#144173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144173
Approved by: https://github.com/jansel
2025-01-05 13:48:15 +00:00
b01556bd8a Revert "[dynamo][dicts] Guarding lazily on dict keys (#143997)"
This reverts commit f5df082fabfe81639e25b8e01dae107548389c5e.

Reverted https://github.com/pytorch/pytorch/pull/143997 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal ci redness in some tests, D67828366 ([comment](https://github.com/pytorch/pytorch/pull/143997#issuecomment-2571587599))
2025-01-05 11:09:45 +00:00
1e881ceecf Update torch-xpu-ops commit pin (#143984)
Update the torch-xpu-ops commit to [28cfac20ec662abdb0ac98faf122450013e8f520](28cfac20ec), includes:

- Disable batch_norm vectorization path to fix accuracy issues.
- Fix the LSRM/RNN implementation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143984
Approved by: https://github.com/EikanWang, https://github.com/ruidazeng, https://github.com/desertfire, https://github.com/jansel
2025-01-05 09:01:36 +00:00
157c185afe [inductor] Add types to compile_tasks.py and runtime_utils.py (#144004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144004
Approved by: https://github.com/yanboliang
2025-01-05 08:47:49 +00:00
67f85ccdcf [ca] add test_dtensor_compile.py to compiled autograd tests (#144107)
more than half the tests use autograd, pass rate 19/26

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107
Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel
2025-01-05 02:11:48 +00:00
f2d6cfa677 Introduce CompileEventLogger, replace usages of metrics_context and chromium_event with it (#143420)
**Problem statement**: I want to be able to centralize and simplify the process by which people add columns/data to existing spans. We have MetricsContext and ChromiumEventLogger, and there's various choices you can make to decide where and when to log different levels of observability for your events. To resolve this, I want a central API for "adding to events under dynamo_timed".

**CompileEventLogger** is intended as a frontend for MetricsContext and ChromiumEventLogger so we can use the same class for handling everything.

CompileEventLogger is intended be used within a `dynamo_timed()` context. Its purpose is to 1. log to existing events that are in progress (i.e. within dynamo_timed), and 2. log instant events to chromium that are independent of any specific span.

CompileEventLogger has three log levels:

- CHROMIUM: Log only to chromium events, visible via tlparse.
- PT2_COMPILE: Log to chromium_events + pt2_compile_events
- COMPILATION_METRIC: Log to compilation metrics in addition to the toplevel chromium and pt2_compile_event.

In addition, we have a function CompileEventLogger.add() that automagically chooses the correct log level. For now, it is conservative, and will never automagically choose to log CompilationMetrics (though I could imagine it figuring out the metadata are all keys in CompilationMetric and therefore loggable there).

The goal here is to make one single interface to log stuff for observability reasons, and make it as easy as possible.

Not included in this diff:
- V1 of this diff will not have implementations of `increment` and `add_to_set` which MetricsContext has, so those usages are not replaced yet. But I'll add those in a followup.

- We don't handle `RuntimeMetricsContext`. It's unclear if I want that to be part of this, because under RuntimeMetricsContext there might not be a toplevel event to log to, so chromium events doesn't make sense in that context. So I might leave that separate for now.

Differential Revision: [D67346203](https://our.internmc.facebook.com/intern/diff/D67346203/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143420
Approved by: https://github.com/aorenste
2025-01-04 22:40:34 +00:00
68d30c6a25 Add check for unsupported sprase layout to resolve false INTERNAL ASSERT FAILED (#139198)
Fixes #131319. Implemented the check on layout as described in the original issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139198
Approved by: https://github.com/pearu, https://github.com/amjames, https://github.com/cpuhrsch

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Pearu Peterson <pearu.peterson@gmail.com>
2025-01-04 21:40:36 +00:00
b1bc880f26 [EZ][BE] Cleanup test_mps_basic (#144194)
- Sort imported tests alphabetically
- Run `add` tests with `check_lowp=False` as it is tested explicitly by parametrization
- Do not hardcode device, but rather use `self.device` property

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144194
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-01-04 21:36:40 +00:00
0dc1e6be19 [mps/BE] Fix linter warning/advice. (#144199)
Two spaces before an inline comment according to E261.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144199
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-04 20:15:41 +00:00
e458b39fc4 c10::string_view -> std::string_view in Device.cpp (#144178)
Test Plan: Sandcastle

Differential Revision: D67817163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144178
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-04 18:51:33 +00:00
811c714911 Fix nan propagation for minimum() and maximum() in MPS (#144086)
Fixes #143976

- Moves minimum and maximum operations to use the NaN propagating call into MPSGraph instead of the default one.
 - Adds test for the NaN propagating case to `test_mps.py`.
- Adjusts the inductor metal backend implementation for minimum and maximum to also respect the nan propagation.

Additions by @malfet:
 - Introduce MPSGraph+PyTorchFixups interface following [Customizing existing classes](https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ProgrammingWithObjectiveC/CustomizingExistingClasses/CustomizingExistingClasses.html) tutorial and implement `minimumWithNaNPropagationAndIntFallbackWithPrimaryTensor:` as `minimumWithNaNPropagationWithPrimaryTensor:` segfaults when called for integral types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144086
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-01-04 18:48:24 +00:00
60de73c3c7 Update nightly PyTorch version to 2.7.0
Same as https://github.com/pytorch/pytorch/pull/135916
2025-01-04 13:24:48 -05:00
f5df082fab [dynamo][dicts] Guarding lazily on dict keys (#143997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997
Approved by: https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141, #144158, #144163, #144160
2025-01-04 18:13:00 +00:00
005a4b9537 [Submodule] Bump Cutlass to 3.5.1 OSS PR (#144000)
## Summary
Follow up PR to https://github.com/pytorch/pytorch/pull/143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1.

I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144000
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet
2025-01-04 18:04:03 +00:00
93633d0e80 [ROCm][Windows] Fix export macros (#144098)
For correct import and export of functions when the dynamic linkage is used for HIP libraries on windows, the appropriate export/import macros need to be put in place. This Pull Request utilizes existing CUDA import/export macros by converting them to corresponding HIP macros during the hipification process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144098
Approved by: https://github.com/jeffdaily
2025-01-04 17:12:46 +00:00
45ef3309e3 [BE] typing for decorators (#144161)
Summary:
Untyped decorators strip annotations from the decorated items.

- _compile
- _inductor/fx_passes/post_grad
- _inductor/lowering
- _library/custom_ops
- _meta_registrations
- _ops
- _refs/nn/functional
- ao/quantization/quantizer/xnnpack_quantizer_utils
- distributed/_composable/contract
- fx/experimental/graph_gradual_typechecker
- fx/experimental/migrate_gradual_types/constraint_generator
- optim/optimizer
- signal/windows/windows
- testing/_internal/common_device_type
- torch/_inductor/decomposition
- utils/flop_counter

Test Plan: unit tests

Differential Revision: D62302684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-04 16:40:09 +00:00
79cbda3ab0 [ROCm] Get rid of extra rpath-link that was needed for libtinfo. (#143348)
Fixes #137858

Due to the extra rpath-link line inserted by these CMake lines, it is possible to unintentionally pick up other libraries that are incompatible with the version of ROCm in ${ROCM_PATH}.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143348
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/pruthvistony
2025-01-04 15:42:30 +00:00
6f2451c2e9 [MPS] Add aten::angle (#143449)
This adds an MPS backend implementation for `aten::angle` and `aten::angle_out` (mentioned in issue #77764), following the example #78408.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143449
Approved by: https://github.com/malfet
2025-01-04 15:38:40 +00:00
301c457032 [MPS] Fix nllnd_loss_backward crash with different dtypes (#144170)
Otherwise, invoking with torch.half inputs, but float weights will result in
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<*xf32>
2025-01-03 14:13:18.747151-0800 python[87772:4027380] /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm, line 975: error 'original module failed verification'
/AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification'
```

Test plan: `python -mpytest test/inductor/test_torchinductor.py -k test_nll_loss_backward_mps` should not crash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144170
Approved by: https://github.com/kit1980, https://github.com/Skylion007
ghstack dependencies: #144167, #144162, #144083, #144084
2025-01-04 15:24:55 +00:00
99f2491af9 Revert "Use absolute path path.resolve() -> path.absolute() (#129409)"
This reverts commit 45411d1fc9a2b6d2f891b6ab0ae16409719e09fc.

Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, @albanD please help get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2571316444))
2025-01-04 14:17:20 +00:00
cyy
df458be4e5 [4/N] Apply py39 ruff and pyupgrade fixes (#143257)
```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257
Approved by: https://github.com/justinchuby, https://github.com/albanD
2025-01-04 10:47:51 +00:00
a881954b0c [PTD] Dump rcclexp proxy trace in pytorch (#143678)
Summary:
Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts.

Added
`#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console.

This is the changes of the PTD.

Test Plan:
Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3
 {F1971449692}

Reviewed By: c-p-i-o

Differential Revision: D67036093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678
Approved by: https://github.com/c-p-i-o
2025-01-04 10:20:47 +00:00
aa7d01ea22 Use sccache 0.9.0 on ROCm build job (#144125)
TSIA, sccache 0.9.0 seems to work fine with ROCm build job

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144125
Approved by: https://github.com/jithunnair-amd, https://github.com/wdvr, https://github.com/jeffdaily
2025-01-04 08:56:48 +00:00
636a2c7e0f [Inductor][lowering] support out_dtype for dequant lowering (#143845)
In lowering, support the parameter `out_dtype` for `dequant_per_tensor` and `dequant_per_channel`.

Fix the following runtime error issue found in https://github.com/pytorch/ao/pull/1372:

```
File "/home/liaoxuan/pytorch_ao/torch/_inductor/lowering.py", line 452, in wrapped
    out = decomp_fn(*args, **kwargs)
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
LoweringException: TypeError: quantized_decomposed_dequantize_per_tensor_default() got an unexpected keyword argument 'out_dtype'
  target: quantized_decomposed.dequantize_per_tensor.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[1, 7, 7, 9], stride=[441, 63, 9, 1]))
  ))
  args[1]: 0.01
  args[2]: 100
  args[3]: 0
  args[4]: 255
  args[5]: torch.uint8
  kwargs: {'out_dtype': torch.bfloat16}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143845
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-01-04 08:48:41 +00:00
417d9c3522 [Inductor/Triton] Upcast FP16/BF16 math reductions to FP32 (#141052)
Summary:
Triton compiler does not automatically promote fp16/bf16 reductions to fp32  accumulation. This will result in significant accuracy issue.

This diff will upcast the input to FP32 for all math reductions `["welford_reduce", "welford_combine", "prod", "sum", "xor_sum"]`

Test Plan:
CI
```
python test/inductor/test_torchinductor.py TritonCodeGenTests.test_low_precision_reduction
```

Differential Revision: D65965032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141052
Approved by: https://github.com/blaine-rister
2025-01-04 07:57:10 +00:00
816328fa51 [dynamo][lazy] LazyVT utils to get original value/source and is_hashable (#144160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144160
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141, #144158, #144163
2025-01-04 06:23:05 +00:00
b5b1e9456a [MPSInductor] Add masked implementation (#144084)
More or less borrowed from
22580f160e/torch/_inductor/codegen/halide.py (L549-L563)

`pytest test/inductor/test_torchinductor.py -k _mps` score is 408 failed, 347 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144084
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #144167, #144162, #144083
2025-01-04 04:30:07 +00:00
f15af077fb Fix get_source_partitions when weights are tied (#142446)
Summary:
Fix https://github.com/pytorch/pytorch/issues/142035 and  https://github.com/pytorch/pytorch/issues/143621

When Linear module params are tied to another parameter, like this:

```
class SimpleLinearModel(nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleLinearModel, self).__init__()
        # Define a linear layer
        self.linear = nn.Linear(input_size, output_size)
        self.tied_weight = self.linear.weight

    def forward(self, x):
        # Forward pass through the linear layer
        b = self.tied_weight + 1
        return self.linear(x), b
```

We get a graph like below:

```
graph():
    %p_tied_weight : [num_users=0] = placeholder[target=p_tied_weight]
    %p_linear_weight : [num_users=2] = placeholder[target=p_linear_weight]
    %p_linear_bias : [num_users=1] = placeholder[target=p_linear_bias]
    %x : [num_users=1] = placeholder[target=x]
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%p_linear_weight, 1), kwargs = {})
    %linear : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%x, %p_linear_weight, %p_linear_bias), kwargs = {})
    return (linear, add)
```

Notice that ` %p_linear_weight : [num_users=2]`.

When we get source partitions, we should exclude attributes nodes like `p_linear_weight` from outputs.

A real world example where people do something like this is in https://github.com/pytorch/pytorch/issues/142035.

Test Plan:
```
 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r test_module_partitioner_weight_tied
```

Differential Revision: D66998592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142446
Approved by: https://github.com/angelayi
2025-01-04 04:28:20 +00:00
cyy
f9bf9057ef Fix ruff warnings in caffe2 and functorch (#144182)
In preparation for upgrading ruff config to py3.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144182
Approved by: https://github.com/malfet
2025-01-04 04:15:01 +00:00
ec1f56fdcf [user triton] add support for prune_configs_by in @triton.autotune (#142207)
This PR adds support for prune_configs_by in the @triton.autotune decorator [docs](https://triton-lang.org/main/python-api/generated/triton.autotune.html#triton.autotune). Supporting this lets users reduce autotuning time by running user-supplied code (early_config_prune, perf_model) to prune the provided list of configs.

We implement this by realizing args/kwargs in call_triton_kernel(...), and then calling kernel.prune_configs(...).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142207
Approved by: https://github.com/zou3519, https://github.com/aakhundov
2025-01-04 03:50:28 +00:00
479d6f2199 [mps/inductor] Add support for log(). (#144169)
Tested via:

```
 % pytest test/inductor/test_mps_basic.py
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144169
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-04 03:07:56 +00:00
087c625261 [dynamo] Trace torch.typename (#144163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144163
Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141, #144158
2025-01-04 02:52:58 +00:00
3292220c43 [dynamo][easy] Move symnode helpers to utils (#144158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144158
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141
2025-01-04 02:52:58 +00:00
98949df7a4 Fix torch.distributed._functional_collectives.AsyncCollectiveTensor for aten.to. (#134661)
Fixes #133421

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134661
Approved by: https://github.com/bdhirsh
2025-01-04 02:33:38 +00:00
eqy
7e3cd0e488 [CUDA] Check size calculation in ilpReduce for softmax (#144009)
For #143644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144009
Approved by: https://github.com/Skylion007
2025-01-04 02:31:15 +00:00
eqy
dbdda654af [64-bit][CUDA] Upsample2D 64-bit indexing fix attempt 2 (#141923)
#141831
Block/thread math requires a cast...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141923
Approved by: https://github.com/ngimel
2025-01-04 02:30:38 +00:00
1d091e47d6 [Inductor UT] Generalize device-bias code in test_torchinductor.py introduced by #143884. (#144057)
Fix #144056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144057
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-01-04 02:24:33 +00:00
22580f160e Multinomial sampling fix on mps for non contiguous tensors (#141515)
Fixes #141457

As for the tests. I looked in `test/test_mps.py` but I saw that `test_multinomial` function is disabled. Glad to add test where needed if there is some place where multinomial function is tested on metal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141515
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-04 01:21:37 +00:00
464b50dbd7 [MPSInductor] Add floor_div and index_expr implementation (#144083)
Simply copy-n-pasted from CPPInductor

`pytest test/inductor/test_torchinductor.py -k _mps` score is 418 failed, 337 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144083
Approved by: https://github.com/jansel
ghstack dependencies: #144167, #144162
2025-01-04 01:10:01 +00:00
6d25938540 [MPSInductor] Add remainder op (#144162)
For it to return correct result for half precision type it must be
upcast to float

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144162
Approved by: https://github.com/jansel
ghstack dependencies: #144167
2025-01-04 00:47:40 +00:00
f8e1eacf2f [MPSInductor] Extend constant to bool type (#144167)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144167
Approved by: https://github.com/jansel
2025-01-04 00:47:40 +00:00
d41134f7e5 [Inductor] Fix torch.polygamma() when n == 0 (#144058)
Fixes #143648

aten:

dec1a6d0f0/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L436-L447)

compiled kernel code:

```
cpp_fused_polygamma_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_devuser/tmpi1d9ksww/db/cdb7hyptwxpzukwd42x4ajfjlgrpum4a4htdd6lhb65apclsmno4.h"
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        {
            {
                auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(0.0);
                auto tmp2 = tmp1 == 0 ? calc_digamma(tmp0) : calc_polygamma(tmp0, tmp1);
                out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144058
Approved by: https://github.com/jansel
2025-01-04 00:22:10 +00:00
52742b07c5 remove allow-untyped-defs from nn/utils/_deprecation_utils.py (#144136)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144136
Approved by: https://github.com/aorenste
2025-01-03 23:44:14 +00:00
0a94bb432e [ROCm] CK Flash Attention Backend (#143695)
Replace https://github.com/pytorch/pytorch/pull/138947 for re-import.

Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695
Approved by: https://github.com/malfet

Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-03 22:01:36 +00:00
3251171ae8 Make whl metadata public readable (#144164)
After https://github.com/pytorch/pytorch/pull/143677/files#r1902138480 lands, the new nightly wheel metadata is not readable publicly causing pip install to fail, for example https://github.com/pytorch/pytorch/actions/runs/12603415308/job/35128414909.

FBGEMM folks are also noticed this failure on their end (cc @q10)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144164
Approved by: https://github.com/clee2000
2025-01-03 21:08:49 +00:00
9bf2a9a616 [ScaledMM] Fix NaNs in test for garbage input data (#144042)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144042
Approved by: https://github.com/janeyx99
2025-01-03 21:02:25 +00:00
b75f32b848 Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139)
Address related comments earlier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144139
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-01-03 20:41:36 +00:00
64bffb3124 remove allow-untyped-defs onnx/_internal/exporter/_fx_passes.py (#144134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144134
Approved by: https://github.com/Skylion007
2025-01-03 20:18:40 +00:00
64b197b603 remove allow-untyped-defs from export/_remove_auto_functionalized_pass.py (#144135)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144135
Approved by: https://github.com/Skylion007
2025-01-03 20:08:11 +00:00
9b8a4e7141 remove allow-untyped-defs from torch/onnx/operators.py (#144133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144133
Approved by: https://github.com/Skylion007
2025-01-03 20:07:56 +00:00
6e09d32c00 remove allow-untyped-defs from torch/jit/_passes/_property_propagation.py (#144132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144132
Approved by: https://github.com/Skylion007
2025-01-03 20:07:37 +00:00
eb7a303d21 [dtensor] expose the __create_chunk_list__ in the doc (#144100)
as titled, this PR expose this dunder method as a public API in the doc,
so that different checkpoint implementations can leverage this protocol,
instead of exposing a separate API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100
Approved by: https://github.com/awgu
ghstack dependencies: #144099
2025-01-03 20:06:23 +00:00
45411d1fc9 Use absolute path path.resolve() -> path.absolute() (#129409)
Changes:

1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()`
2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409
Approved by: https://github.com/albanD
2025-01-03 20:03:40 +00:00
e9e18a9617 remove allow-untyped-defs from _export/db/logging.py (#144093)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144093
Approved by: https://github.com/Skylion007
2025-01-03 19:36:14 +00:00
ad09395674 [MPSInductor] Fix multi rangevar kernel invocation (#144050)
By changing `thread_position_in_grid` type to uint{n} and passing
dimentions during the kernel call

`pytest test/inductor/test_torchinductor.py -k _mps` score is 445 failed, 309 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144050
Approved by: https://github.com/jansel
ghstack dependencies: #144055, #144051, #144122, #144105, #144156
2025-01-03 19:32:43 +00:00
52e107a7ca [MPSInductor] Add constant, isinf and isnan ops (#144156)
Per Table 6.5 of [Metal Language Specification](https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) infinity is `HUGE_VALF`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144156
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #144055, #144051, #144122, #144105
2025-01-03 19:32:43 +00:00
383ff4011c [ez] Use strip for arg sanitization in upload_metadata_file to improve readability (#144155)
Minor thing that improves readability.  I didn't realize you could specify characters for strip when I wrote this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144155
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-03 19:25:30 +00:00
8b3479e361 remove allow-untyped-defs from torch/distributed/fsdp/_dynamo_utils.py (#144131)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144131
Approved by: https://github.com/Skylion007
2025-01-03 19:07:21 +00:00
7b69f7b449 Clarify what we mean by decoupled weight decay in the *AdamWs (#144101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144101
Approved by: https://github.com/albanD
2025-01-03 19:06:00 +00:00
c36f94b373 [while_loop][dynamo] auto-unspecialize int input and output to unbacked symints (#143106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143106
Approved by: https://github.com/zou3519
ghstack dependencies: #143105, #143545
2025-01-03 19:01:07 +00:00
5660709856 [hop][BE] unify meta checking with check_meta_consistency (#143545)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143545
Approved by: https://github.com/zou3519
ghstack dependencies: #143105
2025-01-03 19:01:07 +00:00
6e8dca9ff3 [while_loop][aot] auto-unspecialize int input and output to unbacked symints (#143105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143105
Approved by: https://github.com/zou3519
2025-01-03 19:01:07 +00:00
56f6289f6a [mps/inductor] Add support for atanh(). (#144121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144121
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-03 18:55:05 +00:00
a7b61c5b49 [MPSInductor] Add signbit op support (#144105)
By mapping it to `metal::signbit`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144105
Approved by: https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #144055, #144051, #144122
2025-01-03 18:34:46 +00:00
8d63a4a409 Revert "Set enable_trace_contextlib_contextmanager flag to True (#140604)"
This reverts commit 1c817fe6714cec510ccc6022b2c3e66146c3ad59.

Reverted https://github.com/pytorch/pytorch/pull/140604 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/140604#issuecomment-2569640837))
2025-01-03 18:23:53 +00:00
c5c897c3a1 [dynamo][easy] Miscellaneous fixes (#144141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144141
Approved by: https://github.com/williamwen42
ghstack dependencies: #144129, #144130
2025-01-03 18:22:56 +00:00
732359c633 [dynamo][easy] Minor fixes in guards.cpp (#144130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144130
Approved by: https://github.com/williamwen42
ghstack dependencies: #144129
2025-01-03 18:22:56 +00:00
a450e177fd [dynamo] remove inline inbuilt tests as flag is enabled by default (#144129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144129
Approved by: https://github.com/williamwen42
2025-01-03 18:22:56 +00:00
2409b49a33 Revert "Rewrite _reparametrize_module to use contextmanager (#138203)"
This reverts commit 7bf3b7cdc5631f9991eebcdd8ec09095339a9973.

Reverted https://github.com/pytorch/pytorch/pull/138203 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/138203#issuecomment-2569634001))
2025-01-03 18:17:32 +00:00
60fe8a65af [Inductor] Generalize tiling algorithm to handle fused reductions (#144041)
# Issue

This PR cleans up an edge case that wasn't handled by https://github.com/pytorch/pytorch/pull/137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges.

# Fix

Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel.

Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with  a node of ranges `([8], [])` when we have `reduction_numel=8`.

Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND.

# Test plan

This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144041
Approved by: https://github.com/jansel
2025-01-03 18:16:27 +00:00
e93f625d00 [AOTI] don't codegen autotune_at_compile_time for non-Triton kernels (#143990)
`autotune_at_compile_time` is a separate codegen file specifically for autotuning Triton kernels. We can skip it for non-Triton kernels (like CUTLASS).

This test (test_aoti_workspace_ptr) checks that `workspace_0.data_ptr()` is codegen-ed correctly in AOTI.

```
// in AOTI codegen
kernels.cuda_fused_0(
  (const half*)arg0_1.data_ptr(), (const half*)arg1_1.data_ptr(), (half*)buf0.data_ptr(),
  (int)200, (int)5216, (int)10432, (int)10432, (int)5216, (int)0, (int)5216,
  (size_t*)nullptr, (uint8_t*)workspace_0.data_ptr(), stream);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143990
Approved by: https://github.com/henrylhtsang, https://github.com/chenyang78, https://github.com/desertfire
2025-01-03 18:01:12 +00:00
f3968373c1 Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118)
CUDA 12.4 is the default now and we don't build nightly 12.1 anymore, so it's time to move the rest of CI jobs to 12.4.  I also clean up some redundant CI jobs on periodic and inductor-periodic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144118
Approved by: https://github.com/atalman
2025-01-03 17:45:41 +00:00
cbdc70ae07 Use the build environment as sccache prefix instead of workflow name (#144112)
This is an attempt to improve cache usage for jobs in non-pull workflows like periodic, slow, or inductor as we are seeing build timeout there from time to time, for example https://github.com/pytorch/pytorch/actions/runs/12553928804.  The build timeout never happens in pull or trunk AFAICT because they are more up to date with the cache content coming from the PR itself.

Logically, the same build should use the same cache regardless of the workflows.  We have many examples where the same build, for example [linux-focal-cuda12.4-py3.10-gcc9-sm86](https://github.com/search?q=repo%3Apytorch%2Fpytorch+linux-focal-cuda12.4-py3.10-gcc9-sm86&type=code), is split between different workflows and, thus, uses different caches.

I could gather some sccache stats from CH in the meantime to try to prove the improvement before and after this lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144112
Approved by: https://github.com/malfet
2025-01-03 17:33:03 +00:00
b9fbd65dfd AOTI fallback ops: remove ops that were never codegen'ed (#143421)
Removes 4 fallback ops that are currently not possible to codegen, which does not break ABI-compatibility.

1. `_cudnn_rnn_backward` and `_histogramdd_bin_edges` both return `Tensor[]`, which we cannot codegen with the current design.
2. `_sparse_coo_tensor_with_dims_and_tensors` only supplies a Sparse operator, which we don't support.
3. `zeros.names` requires a `Dimname` input, which we can't currently codegen.

Removing these ops from the list will improve test performance, since the fallback op generation will use the Python proxy executor instead of calling non-existent C functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143421
Approved by: https://github.com/desertfire
ghstack dependencies: #141371, #143223
2025-01-03 16:05:38 +00:00
b5b419d627 cpp_wrapper: Use runtime dispatched fallbacks for complex ops (#143223)
When calling a fallback op in cpp_wrapper mode, where any of the inputs are complex numbers, utilize the runtime dispatched fallback mode. This properly handles the Conjugate and Negative dispatch keys, if present, in exchange for a performance pessimization in complex arithmetic.

This PR additionally fixes some cascading failure modes exposed in our `aot_inductor` tests by this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143223
Approved by: https://github.com/desertfire
ghstack dependencies: #141371
2025-01-03 16:05:38 +00:00
e88d06f54e ir.ExternKernel: correctly handle kwarg default arguments (#141371)
Additionally, enable torchinductor opinfo tests exercising all
previously fixed bugs in this stack.

Note: I've manually sharded the cpp_wrapper CI checks into 2 shards.
Once all OpInfo tests are enabled we should switch back to automatic
sharding, but until then the pipeline doesn't have appropriate timing
stats.  More shards would be helpful given the compilation slowdown
associated with cpp_wrapper, but 2 will do for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141371
Approved by: https://github.com/desertfire
2025-01-03 16:05:31 +00:00
f7644efa79 [MPSInductor][EZ] Fix logical_[or|end] ops (#144122)
For boolean operands it does not really matter whether `&` or `&&` is
used, but if one ever to rely on operator precedence, then bitwise ops
should have higher precendence than logical ones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144122
Approved by: https://github.com/huydhn
ghstack dependencies: #144055, #144051
2025-01-03 15:28:07 +00:00
b336d72dae [MPSInductor] Preserve dtype during load (#144051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144051
Approved by: https://github.com/Skylion007
ghstack dependencies: #144055
2025-01-03 15:17:33 +00:00
a1ae8fadc7 [cpu][vec] support reduce ops for add and max (#144065)
### Description

During the support of INT8 SDPA https://github.com/pytorch/ao/pull/1372, we find that `at::vec::vec_reduce_all<int32_t>` would go  into slow scalar path when doing sum and max. So here, we support the two reduce-related ops `reduce_add` and `reduce_max` for `vec512` and `vec256`, using the Sequence instructions.

### Details
- Support vectorized `reduce_add` and `reduce_max` for dtypes `int32` and `float32`, using the Sequence instructions;
- Implement the scalar version for fallback path in vec base;
- Add the operator `reduce` in vec base, in order to simplify the codes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144065
Approved by: https://github.com/mingfeima
2025-01-03 13:01:52 +00:00
55dc61dd52 Dataloader distribute tasks to workers when in_order is False (#142324)
Fixes #105203 and is a follow up PR to #141833

When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work.
In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding.
The current default behaviour is left as is.

Tests are also updated to assert on the worker IDs for each sample of data returned.
I've run the following to confirm they aren't flaky
```bash
for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142324
Approved by: https://github.com/andrewkho
2025-01-03 12:57:04 +00:00
c09bf71bd6 [Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848)
Fix https://github.com/pytorch/pytorch/issues/143568
Before:
![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003)
After:
![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-01-03 09:00:43 +00:00
d9507548d8 [dynamo][BE] move zip_longest polyfill to submodule polyfills.itertools (#144067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144067
Approved by: https://github.com/yanboliang
ghstack dependencies: #144066
2025-01-03 08:08:31 +00:00
fb1beb31d2 [dynamo][BE] move dropwhile polyfill to submodule polyfills.itertools (#144066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144066
Approved by: https://github.com/jansel
2025-01-03 08:08:31 +00:00
00df63f09f [ROCm] Fix for ld failed to convert GOTPCREL relocation in PyTorch build (#143986)
I experienced an error while doing a DEBUG build of pytorch on rocm:
```
additional relocation overflows omitted from the output
/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
```
Based on discussions on similar issue #138427, I fixed it after adding the `--offload-compress` to the HIP_HIPCC_FLAGS to successfully build DEBUG mode on my node.

Further updated the PR to enable the flag for non-DEBUG builds as well due to the size reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143986
Approved by: https://github.com/jeffdaily
2025-01-03 06:53:08 +00:00
e141cb9c34 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2025-01-03 05:41:06 +00:00
48a05ee773 [dtensor] improve doc of the DTensor class (#144099)
as titled: explicitly list all public members to make sure the public
API stays consistent, also use groupwise as the member order to make doc
look better

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099
Approved by: https://github.com/awgu
2025-01-03 05:35:44 +00:00
41b5c600df [ReduceOps] Add dimension checking for cummin()/cummax(). (#143920)
Summary: cum{min,max} didn't guard against 0-d vector and allowed an arbitrary dimension to be passed.

Test Plan: torch_test.py

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #71477

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143920
Approved by: https://github.com/malfet
2025-01-03 04:14:33 +00:00
c5b75f8db1 [AOTI] Remove more AOTI_TORCH_EXPORT (#144081)
Summary: Similar to https://github.com/pytorch/pytorch/pull/142500, remove redundant AOTI_TORCH_EXPORT from several cpp files, to solve a windows build issue.

Differential Revision: D67762069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144081
Approved by: https://github.com/yushangdi
2025-01-03 02:17:38 +00:00
c31912666e [ROCm] Print amdgpu info on bare metal for CI runners (#144038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144038
Approved by: https://github.com/jeffdaily
2025-01-03 02:00:40 +00:00
37e9da0687 [ROCm][Windows] Disable roctracer-related code (#143329)
Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329
Approved by: https://github.com/sraikund16
2025-01-03 01:51:01 +00:00
891a86d1ad remove allow-untyped-defs from ao/quantization/experimental/fake_quantize.py (#144091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144091
Approved by: https://github.com/aorenste
2025-01-03 01:26:36 +00:00
377e29745f remove allow-untyped-defs from distributed/elastic/utils/data/cycling_iterator.py (#144090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144090
Approved by: https://github.com/aorenste
2025-01-03 01:22:50 +00:00
0d6db839a7 remove allow-untyped-defs from utils/data/datapipes/iter/streamreader.py (#144088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144088
Approved by: https://github.com/aorenste
2025-01-03 01:21:44 +00:00
bdfb40ed29 remove allow-untyped-defs from utils/_import_utils.py (#144089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144089
Approved by: https://github.com/aorenste
2025-01-03 01:21:13 +00:00
28a74fe3aa remove allow-untyped-defs from torch/mps/event.py (#144092)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144092
Approved by: https://github.com/aorenste
2025-01-03 01:20:17 +00:00
496fc90965 [CI] Multigpu 1 -> 2 shards (#143992)
Fixes #ISSUE_NUMBER
It's been timing out https://github.com/pytorch/pytorch/actions/runs/12544191739/job/34977636276

They're still somewhat uneven but they're both under the limit now.  It would probably be better to use run_test.py's sharding to do this, maybe in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143992
Approved by: https://github.com/huydhn
2025-01-03 00:33:16 +00:00
3eb3f4ed55 Upload METADATA file with whl binaries (#143677)
Upload the metadata file for wheels for pep658 https://peps.python.org/pep-0658/
Using a python script but using bash might be easier...

--

Testing

Example run https://github.com/pytorch/pytorch/actions/runs/12550595201/job/34994883276 without actual upload, just dry run

Lightly tested the script to make sure it uploads to s3, but integration with the bash script + workflow is untested

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143677
Approved by: https://github.com/seemethere
2025-01-03 00:32:05 +00:00
bb5e439f2d Add networkx as bazel dep to fix CI failure (#143995)
Add networkx as a dependency for test_bazel

Example failure: https://github.com/pytorch/pytorch/actions/runs/12551752021/job/34996706301

```

INFO: From Testing //:test_bazel:
==================== Test output for //:test_bazel:
Traceback (most recent call last):
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 33, in <module>
    test_simple_compile_eager()
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 27, in test_simple_compile_eager
    opt_foo1 = torch.compile(foo, backend="eager")
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2533, in compile
    backend = _TorchCompileWrapper(backend, mode, options, dynamic)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2342, in __init__
    self.compiler_fn = lookup_backend(backend)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 66, in lookup_backend
    _lazy_import()
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 102, in _lazy_import
    import_submodule(backends)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/utils.py", line 2797, in import_submodule
    importlib.import_module(f"{mod.__name__}.{filename[:-3]}")
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/execroot/pytorch/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/common.py", line 12, in <module>
    from torch._functorch.aot_autograd import (
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/aot_autograd.py", line 147, in <module>
    from .partitioners import default_partition
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/partitioners.py", line 31, in <module>
    from ._activation_checkpointing.graph_info_provider import GraphInfoProvider
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/_activation_checkpointing/graph_info_provider.py", line 3, in <module>
    import networkx as nx
ModuleNotFoundError: No module named 'networkx'
```

No periodic runs on this PR or its main branch commit, but I'm pretty sure its started on https://togithub.com/pytorch/pytorch/pull/143539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143995
Approved by: https://github.com/huydhn
2025-01-02 19:42:18 +00:00
a8c98ce175 [cutlass-3] Update third-party/cutlass-3 from 3.4 to 3.5.1 (#143515)
# Summary:

This also makes updates to different repositories throughout FB code to roll any updates needed for this new release.

I was not able to get AsyncMM.cu to build (still trying) Yfiu suggested that I just skip it for now

Test Plan:
Have run various build commands to try and expose errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143515
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-02 18:45:11 +00:00
8506a2af9a remove allow-untyped-defs from _export/pass_infra/proxy_value.py (#143944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143944
Approved by: https://github.com/aorenste
ghstack dependencies: #143943
2025-01-02 18:17:03 +00:00
8f3eb84373 ROCm: Enable 4 gpu tests for distributed config (#140319)
Change the label to make sure the jobs land on a
node which has >= 4 GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140319
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/kwen2501
2025-01-02 17:22:11 +00:00
916b510ff5 Enable mkldnn pattern matcher tests for BF16 on AArch64 (#144030)
Fixes #143146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144030
Approved by: https://github.com/malfet
2025-01-02 17:13:38 +00:00
a93e75d1e2 [MPS] Handle implicit cpu-scalar-to-gpu transfer (#144055)
Followup after https://github.com/pytorch/pytorch/pull/143934, this check is no longer necessary and fixes a subset of inductor tests

Before `pytest test/inductor/test_torchinductor.py -k _mps` reports 463
failed, 291 passed, 32 skipped after 456 failed, 298 passed, 32 skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144055
Approved by: https://github.com/Skylion007
2025-01-02 17:12:39 +00:00
0431d47eaa [tp] propagate src_data_rank kwarg in TP API (#144005)
as titled, this PR propagates the src_data_rank in the TP API, so that
module level APIs could leverage the flexibility to choose
src_data_rank, and avoid the communication if it does not need to

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144005
Approved by: https://github.com/tianyu-l
ghstack dependencies: #143883
2025-01-02 05:35:52 +00:00
f242dbb76f [dtensor] add src_data_rank to distribute_tensor API (#143883)
As titled, this PR add a kwarg src_data_rank to the distribute_tensor
API, to allow user specify a specific rank as the full tensor source
data. Previously we by default specify group_rank=0 as the source of
truth for single device semantic, this new option:

* gives advanced user flexiblity to choose the source data rank
* allow user to specify None explicity, which means we will skip the
  communications needed (scatter/broadcast) for the cases that does not
care about single device semantic (i.e. loading from a checkpoint)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143883
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
2025-01-02 05:35:52 +00:00
dec1a6d0f0 [dynamo] Separate out GetItemSource and DictGetItemSource (#143926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143926
Approved by: https://github.com/jansel
2025-01-01 02:39:41 +00:00
8d9ff9c8a4 Fix a bug for wrong stride in fake tensor (#141427)
Fixes #141426

Please see details in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141427
Approved by: https://github.com/jansel
2024-12-31 23:45:32 +00:00
e7ed660233 [inductor] Add missing py312 xfail (#144006)
See #144006

```py
__________________________________________ CudaReproTests.test_repeated_masked_load __________________________________________
RuntimeError: First class dim doesn't work with python 3.12

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/jansel/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
  File "/home/jansel/pytorch/test/inductor/test_cuda_repro.py", line 1678, in test_repeated_masked_load
    from functorch.einops import rearrange
  File "/home/jansel/pytorch/functorch/einops/__init__.py", line 1, in <module>
    from .rearrange import rearrange
  File "/home/jansel/pytorch/functorch/einops/rearrange.py", line 7, in <module>
    from functorch._C import dim as _C
ImportError: initialization failed
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144006
Approved by: https://github.com/Skylion007
2024-12-31 23:37:05 +00:00
a174ee2255 Revert "Fix duplicate pattern error (#139321)"
This reverts commit 9e8d84f8631317ce61de4f0f9731fc1b1ccc3d2b.

Reverted https://github.com/pytorch/pytorch/pull/139321 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/139321#issuecomment-2566620095))
2024-12-31 17:44:02 +00:00
d8a2796fb6 Revert "[Inductor UT] Generalize newly introduced device-bias hard code in (#143975)"
This reverts commit 7c1c0730beed9bb05a16ba678a8f32b29fdd0a29.

Reverted https://github.com/pytorch/pytorch/pull/143975 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/139321 feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/143975#issuecomment-2566619312))
2024-12-31 17:41:06 +00:00
eec30916e7 Revert "Update low prec codegen for div/mod (#142350)"
This reverts commit 135a2d44830b2de1ed6714f52cc6a612406adb6d.

Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2566615835))
2024-12-31 17:35:32 +00:00
5ef0de7615 [MPSInductor] Fix multiple kernel generation (#143998)
At the moment by generating multiple MetalLibraries

`pytest test/inductor/test_torchinductor.py -k _mps` score is 434 failed, 317 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143998
Approved by: https://github.com/jansel, https://github.com/ruidazeng
ghstack dependencies: #143948, #143949, #143973, #143977
2024-12-31 13:51:50 +00:00
f0f09bb3c2 [MPSInductor] Implement minimum and maximum ops (#143977)
By calling `metal::min` and `metal::max` respectively with argument
typecast to a common type to avoid ambiguous calls errors

TODO: Implement NaN propagation for both eager and compile, see https://github.com/pytorch/pytorch/issues/143976

`pytest test/inductor/test_torchinductor.py -k _mps` score is 460 failed, 291 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143977
Approved by: https://github.com/jansel
ghstack dependencies: #143948, #143949, #143973
2024-12-31 13:51:50 +00:00
09e47ab7ab Refine CUDA Stream priority (#143849)
# Motivation
As mentioned in https://github.com/pytorch/pytorch/pull/141119#discussion_r1897480515, we properly handle the priority value if it is outside of the priority range.

# Additional Context
If the value falls outside of the allowed priority range, it will automatically be mapped to the nearest valid priority(either lowest or highest).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143849
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119, #141123, #143799
2024-12-31 11:15:59 +00:00
3848de55ed Add get_stream_from_external API for CUDA backend (#143799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119, #141123
2024-12-31 11:15:59 +00:00
8f6c4d1732 Add get_stream_from_external API for XPU backend (#141123)
# Motivation
This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch.

# Additional Context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119
2024-12-31 11:15:52 +00:00
a68c0ca497 Add low priority XPU Stream (#141119)
# Motivation
Due to the potential for the external SYCL queue to have a low priority, we need to support the low-priority SYCL queue for native XPU Streams to maintain consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141119
Approved by: https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #142347
2024-12-31 11:15:45 +00:00
39450ae655 Refine XPU external Stream (#142347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142347
Approved by: https://github.com/gujinghui, https://github.com/albanD
2024-12-31 11:15:38 +00:00
16a57e232c removed dead code for dynamo flag dead_code_elimination (#140938)
Fixes #136862

1.  removed dead code from torch/_dynamo/convert_frame.py
2.  ran `lintrunner -a` and all the tests passed.
3. ran the unit tests and everything seems to be in order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140938
Approved by: https://github.com/zou3519
2024-12-31 09:27:43 +00:00
01034e963c [AOTI] Not use AOTI_TORCH_CHECK in non AOTI mode. (#143970)
Fix #143967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143970
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-31 06:28:32 +00:00
a2753e376b [Inductor] Support tiling reduction dimensions (#137243)
Fixes #134277 and https://github.com/pytorch/pytorch/issues/142317.

Sub-PRs containing refactors from this one:
 - https://github.com/pytorch/pytorch/pull/141733
 - https://github.com/pytorch/pytorch/pull/141738
 - https://github.com/pytorch/pytorch/pull/141751 (based off the former)
 - https://github.com/pytorch/pytorch/pull/142249
 - https://github.com/pytorch/pytorch/pull/142020
 - https://github.com/pytorch/pytorch/pull/143135

 These refactor PRs should land before the main one.

# Feature

*Note: to minimize risk, multi-dimensional reductions are gated by the flag `config.triton.tile_reductions`, which defaults to False.*

Instead of having a single reduction dimension called `"r"`, we can now support 2D reductions with `"r0_"` and `"r1_"` dimensions. 2D reductions generate two nested loops, with different block pointer advancements in each loop body. Most of the implementation is generic to ND reductions, but for now the tiling algorithm sets a hard limit at 2D.

Here's an example of a 2D persistent reduction kernel:
```
@triton.jit
def triton_per_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr):
    xnumel = 1
    r0_numel = 15
    R0_BLOCK: tl.constexpr = 16
    r1_numel = 15
    R1_BLOCK: tl.constexpr = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1)
    r0_index = tl.arange(0, R0_BLOCK)[None, :, None]
    r0_offset = 0
    r0_mask = r0_index < r0_numel
    r1_index = tl.arange(0, R1_BLOCK)[None, None, :]
    r1_offset = 0
    r1_mask = r1_index < r1_numel
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    roffset = r1_offset + (r0_offset*r1_numel)
    rindex = r1_index + (r0_index*r1_numel)
    r0_0 = r0_index
    r1_1 = r1_index
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[15, 15], strides=[30, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[r0_offset, r1_offset]), boundary_check=[0, 1], padding_option='zero')[None, :, :]
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
    tmp3 = tl.where(r0_mask & r1_mask, tmp1, 0)
    tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])
    tmp5 = tl.sum(tmp4, 1)[:, None, None]
    tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp5, None)
''', device_str='cuda')
```

There are a few main differences between this kernel and what Inductor would generate without this PR.
 - Instead of an `r`/`RBLOCK` dimension, we have two reduction dimensions: `r0_`/`R0_BLOCK` and `r1_`/`R1_BLOCK`.
 - There are special size and indexing variables for reductions, which don't directly correspond to any kernel dimension. (`rindex`, `rnumel`, `RBLOCK`, and `roffset`.) These collapse N-D reduction sizes and indices indices into 1D. This simplifies the codegen for reductions, which sometimes want to access linear indices instead of N-dimensional ones. Doing things this way allows us to generate N-D loads and stores, but access this data as if it were 1D, minimizing the blast radius of this PR. Although this makes the code more verbose, it shouldn't have a perf impact because the triton compiler eliminates dead code.
 - We generate the line `tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])` before performing the actual reduction. This reshapes N reduction dimensions into 1D. This allows us to reduce over all N dimensions at once, simplifying the codegen and allowing the Triton complier to decide the order of processing under the hood.

Here's an example of a looped reduction:
```
@triton.jit
def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr):
    xnumel = 3
    r0_numel = 43
    r1_numel = 129
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = xindex < xnumel
    r0_base = tl.arange(0, R0_BLOCK)[None, :, None]
    r1_base = tl.arange(0, R1_BLOCK)[None, None, :]
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    rbase = r1_base + (r0_base*r1_numel)
    x0 = xindex
    block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[3, 43, 129], strides=[11094, 258, 1], block_shape=[XBLOCK, R0_BLOCK, R1_BLOCK], order=[2, 1, 0], offsets=[xoffset, 0, 0])
    _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 0, tl.float32)
    for r0_offset in range(0, r0_numel, R0_BLOCK):
        r0_index = r0_offset + r0_base
        r0_mask = r0_index < r0_numel
        for r1_offset in range(0, r1_numel, R1_BLOCK):
            r1_index = r1_offset + r1_base
            r1_mask = r1_index < r1_numel
            roffset = r1_offset + (r0_offset*r1_numel)
            rindex = r1_index + (r0_index*r1_numel)
            r0_1 = r0_index
            r1_2 = r1_index
            tmp0 = tl.load(block_ptr0, boundary_check=[0, 1, 2], padding_option='zero', eviction_policy='evict_first')
            tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
            tmp3 = _tmp2 + tmp1
            _tmp2 = tl.where(r0_mask & r1_mask & xmask, tmp3, _tmp2)
            block_ptr0 = tl.advance(block_ptr0, [0, 0, R1_BLOCK])
        block_ptr0 = tl.advance(block_ptr0, [0, R0_BLOCK, (-1)*R1_BLOCK*((128 + R1_BLOCK) // R1_BLOCK)])
    tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK])
    tmp2 = tl.sum(tmp4, 1)[:, None, None]
    tl.store(tl.make_block_ptr(out_ptr0, shape=[3], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.reshape(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

In addition to the aforementioned changes to the persistent reduction, multidimensional looped reductions have a few more lines of code:
 - They calculate indices inside the loop using `r0_base` and `r1_base`. For compatibility with existing codegen, these are collapsed to the 1D variant `rbase`.
 - Block pointer advancements are more nuanced for multidimensional loops. At the end of each loop body, we emit a `tl.advance` line which not only increments the pointer in its own dimension, but also undoes the cumulative increments of the previous loop level. This is equivalent to the usual practice in nested loops of starting with a fresh iteration variable at each level. Implementing this required refactoring the way we generate pointer advancements into a new `self.pointer_advancements` field of the kernel, which categorizes advancements by dimension.

The biggest difficulty in implementing this feature was that we represented tiling with a tuple like `(5,2)`. In the existing codebase, the compiler can infer that the reduction dimension of `(5,2)` is `2`, since reductions are always the last dimension. This became cumbersome now that we have to support multiple reduction dimensions, so I refactored tiling into a dict like `{"x": 5, "r0_": 2, "r1_": 4}`. This required quite a few code changes, but I don't think it makes the underlying logic much more complex. This will also make it easier to eventually support simultaneous pointwise and reduction tiling, like `{"x": 5, "y": 5, "r0_": 2, "r1_": 4}`. (This is not supported today, but we might want to do it eventually.)

The existing tiling algorithm generalized naturally to support reductions. For pointwise kernels, we tile the pointwise dimensions (`"x"`, `"y"`) as is. For reduction kernels, we never tile the `"x"` dimension, and only tile the reduction dimensions (`"r0_"`, `"r1_"`). Thus we only ever tile pointwise OR reduction dimensions, but not both. In principle it seems possible to support both, but it would likely require changes to the kernel fusion and autotuning logic. I thought it best to keep this PR as minimal as possible since it already touched a lot of different files.

Unfortunately, these changes weren't enough to get block pointers in some seemingly simple test cases. In some tests for `argmax` and `var_mean`, we already collapse reduction dimensions into 1D and generate modular indexing expressions, prior to tiling. So it's not trivial to figure out how to expand the collapsed reduction dimension back to a shape that would simplify the indexing.

To address these cases, this PR adds a new feature to the `config.prefer_nd_tiling` option, which analyzes reads and writes in the kernel, using the same mod-div pattern matching logic that generates block pointers later on. By matching this pattern, we can solve for the tiling splits which *would* simplify the indexing expression, and use then use that tiling to eliminate the modular indexing and emit a block pointer. This tiling mode is still off by default, but it's important for certain applications where we need to get as many block pointers as possible.

# Test plan

This touches pretty much anything that uses the Triton and Halide backends, so the existing CI provides good coverage. However, 2D reductions are gated behind a few feature flags like `config.prefer_nd_tiling` and `config.tile_reductions`, so this really only checks that the PR doesn't break 1D reductions.

In addition to existing CI tests, this PR also adds some new tests that specifically stress 2D reductions:

- `test_2d_reduction_odd_shapes`: test 2D reductions with a variety of ops and sizes. This covers the typical persistent and looped reductions.
-  `test_2d_reduce_no_x_dim`: test 2D reductions with no x dimension.
-  `test_2d_welford_reduction`: test 2D welford reductions with block pointers.
- `test_welford_non_block_pointer`: test a 2D welford reduction when block pointer analysis fails.
- `test_reduction_multiple_discontiguous_dims`: test reducing over more than one discontiguous dimension. We won't get a block pointer for this case, since that would require 3D tiling, but we're currently limited to 2D.
- `test_2d_reduction_multi_kernel`: test multi kernel autotuning on a 2D softmax kernel.
- `test_enable_tiled_reductions`: test that `config.triton.tile_reductions` enables/disables this feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137243
Approved by: https://github.com/jansel

Co-authored-by: Yueming Hao <yhao@meta.com>
Co-authored-by: Jason Ansel <jansel@meta.com>
2024-12-31 05:06:46 +00:00
f3e5078c27 [Inductor] Relax size constraints for re-inplacing (#143884)
Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing.

This PR changes the size constraints to be:
- input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str).
- it's statically known that 0.99 * input_size <= output_size <= input_size

### Apply on llm.c
See the reuse of `buf1`.
Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078))
![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9)

After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053))
![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884
Approved by: https://github.com/eellison
2024-12-31 03:52:47 +00:00
cyy
8df99b6a6e Remove unneeded std::make_optional (#143575)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143575
Approved by: https://github.com/Skylion007
2024-12-31 03:08:47 +00:00
11bb94b7ea [MPSInductor] Fix index generation for transpose (#143973)
Alas, PythonPrinter would not work here, not would CppPrinter, so start building MetalPrinter.

`pytest test/inductor/test_torchinductor.py -k _mps` score is 474 failed, 277 passed, 32 skipped
Before this change:
`pytest test/inductor/test_torchinductor.py -k _mps` reported 506 failed, 245 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143973
Approved by: https://github.com/jansel
ghstack dependencies: #143948, #143949
2024-12-31 02:04:50 +00:00
cb24013b5b Fix assertion failure in pytorch profiler (#143940)
Summary:
Attempt to fix the following exception which occurred when profiling a Pytorch model ( Meta-internal LLM ) that also involved a ThreadPoolExecutor in the background:
```
Exception Found: !stack.empty() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/autograd/profiler_python.cpp":987, please report a bug to PyTorch. Python replay stack is empty.
```
The root cause of this issue seems to be that a thread call stack can be empty, which is asserted to not be empty.

I fixed this with some minimal changes to profiler_python.cpp

Approach:
 * Ensuring that the stack in question is not empty before trying to pop from it.

Test Plan:
* Tested manually on a reproducible scenario where the assertion failure was otherwise triggered ( repro too large to include here ). The assertion failure disappears.
 * CI

Differential Revision: D67691558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143940
Approved by: https://github.com/Skylion007, https://github.com/sraikund16
2024-12-31 01:43:04 +00:00
cyy
af629a8146 Enable readability-redundant-declaration (#143982)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143982
Approved by: https://github.com/Skylion007
2024-12-31 00:20:10 +00:00
934eaa503f [Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-30 23:51:17 +00:00
d9a6ffb63c [FSDP] Add workaround to fix buffer_dtype without root parameters (#143989)
Fixes https://github.com/pytorch/pytorch/issues/143900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143989
Approved by: https://github.com/H-Huang
2024-12-30 23:42:24 +00:00
2da7fb5320 [inductor] Make generated kernels deterministic (#143951)
`"compile_id"` had slipped into our generated Triton code (in the
metadata), which will defeat caching because the same kernels generated
in a different order would not cache hit with eachother.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951
Approved by: https://github.com/oulgen
2024-12-30 23:35:11 +00:00
d88a8c41d5 Fix flaky "Upload test stats" job (#143991)
Test stat uploading was intermittently failing due to certain XML strings being opportunistically converted to numbers, when string output was expected. This PR makes the conversion behavior optional, which should fix the stat uploads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143991
Approved by: https://github.com/clee2000, https://github.com/huydhn
2024-12-30 21:40:01 +00:00
d260bc4476 cpp_wrapper: minimize pybind11 dependency (#143772)
Only include the parts of `pybind11` that handle GIL management within `cpp_wrapper`. This dramatically improves compilation times by reducing the number of headers we compile. Improvements on my local system are on the order of 2x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143772
Approved by: https://github.com/Skylion007
2024-12-30 20:41:02 +00:00
baee623691 [BE][Ez]: Update fmtlib submodule to 1.11.1 (#143937)
* Exactly the same as previous fmtlib except it fixes an edgecase that could affect ABI compatibility between fmtlib versions.
* Seems safe to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143937
Approved by: https://github.com/albanD
2024-12-30 19:46:27 +00:00
9d026000de change import relative paths due to internal build failures (#143968)
Internal builds failing due to #143355, changing imports to be relative, similar to other imports

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143968
Approved by: https://github.com/albanD
2024-12-30 17:19:49 +00:00
c27c788e35 [MPS] Fix torch.add(x,y, alpha=2) crash (#143949)
TODO: as followup PR replace this weird logic with shaders

Fixes https://github.com/pytorch/pytorch/issues/143932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143949
Approved by: https://github.com/Skylion007
ghstack dependencies: #143948
2024-12-30 17:16:29 +00:00
beb6c2dea5 [MPS] Fix crash when mm is invoked with mixed dtypes (#143948)
Simply by copy-n-pasting check from
a7915c56f6/aten/src/ATen/native/cuda/Blas.cpp (L254-L257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143948
Approved by: https://github.com/Skylion007
2024-12-30 17:13:34 +00:00
7c1c0730be [Inductor UT] Generalize newly introduced device-bias hard code in (#143975)
test_pattern_matcher.py
Fix #143974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143975
Approved by: https://github.com/malfet
2024-12-30 16:47:19 +00:00
cyy
dca443835e Enable more readability-redundant checks (#143963)
They are helpful to simplifying code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963
Approved by: https://github.com/albanD
2024-12-30 14:49:33 +00:00
438698b20b [CD] Remove redundant triton dependency for xpu wheels (#143839)
Due to XPU CD wheels enabled pypi dependencies by https://github.com/pytorch/pytorch/pull/141135, so the PYTORCH_EXTRA_INSTALL_REQUIREMENTS has value for XPU CD wheel build.
Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850
Fixes #143838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143839
Approved by: https://github.com/huydhn
2024-12-30 13:39:06 +00:00
2fa09853cb Update slow tests (#143745)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143745
Approved by: https://github.com/pytorchbot
2024-12-30 11:51:49 +00:00
2ed4d65af0 Update torch-xpu-ops commit pin (#143853)
Update the torch-xpu-ops commit to [214f33](214f33b9d9), includes:

- Fix building issue for transformer related operators
- Improve XPU operator coverage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143853
Approved by: https://github.com/EikanWang
2024-12-30 02:38:16 +00:00
1b0d19a2cb Revert "[inductor] Make generated kernels deterministic (#143951)"
This reverts commit 79b354ee37b7d8a06a48ca8cc4e19a3fd006b433.

Reverted https://github.com/pytorch/pytorch/pull/143951 on behalf of https://github.com/wdvr due to failing tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/143951#issuecomment-2564952267))
2024-12-30 02:06:38 +00:00
cf89127137 [Torch.package] Add support for UntypedStorage tensors (#143930)
Summary: fp8 uses untyped storage. Add support for torch.package by using the same logic as in serialization.py

Differential Revision: D67684033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143930
Approved by: https://github.com/henrylhtsang
2024-12-30 02:03:52 +00:00
92d8965082 Adding support for differentiable lr, weight_decay, and betas in Adam/AdamW (#143726)
Third PR in a series of PRs to broaden differentiable optimizer support w/ @janeyx99 (sorry for pinging over the holidays! I just wanted to put this one out but I am definitely not asking for review or anything like that rn)

This is also going to probably be my last PR before the holidays!

Note: This is a branch of #143710 -- I've never worked on a branch of a branch before so I wasn't sure about the protocol so I thought I'd just made the PR and wait until that one gets merged.

This is adding support for differentiable lr, weight_decay, and betas to Adam and AdamW (but after refactoring AdamW into an Adam subclass, it's really just changing code in torch/optim/adam.py)

I had one main thing I was wondering about, which is that adam already has a differentiable flag built in, so I have code like this
```py
if differentiable and isinstance(beta2, Tensor):
    if beta2.requires_grad:
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
    else:
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
else:
    exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
That I could definitely simplify to just
```py
if differentiable and isinstance(beta2, Tensor):
    exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
else:
    exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```

It would definitely be a little slower in the case that it's differentiable but doesn't need a grad for beta2, but the code would also be a lot more clear and I'm debating speed vs future code usability.

Also the line in the above example:
```py
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
```
was concerning to me because it is considerably more expensive than `value=1 - beta2`, but I couldn't think of a better way to do it.

Further work on #141832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143726
Approved by: https://github.com/janeyx99
2024-12-30 01:11:57 +00:00
a7915c56f6 Propagate callable parameter types using ParamSpec (#142306) (#143797)
The codebase has a few locations where callable parameter type information is lost when the unpackings *args and **kwargs are typed as Any. Refactor these instances to retain type information using typing_extensions.ParamSpec.

Also, in these functions, enforce return type with TypeVar.

Addresses #142306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143797
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2024-12-29 23:03:14 +00:00
79b354ee37 [inductor] Make generated kernels deterministic (#143951)
`"compile_id"` had slipped into our generated Triton code (in the
metadata), which will defeat caching because the same kernels generated
in a different order would not cache hit with eachother.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951
Approved by: https://github.com/oulgen
2024-12-29 19:53:33 +00:00
b6bdb67f82 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-12-29 17:23:13 +00:00
7101b8ca35 remove allow-untyped-defs from onnx/_internal/_lazy_import.py (#143943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143943
Approved by: https://github.com/justinchuby
2024-12-29 10:29:43 +00:00
cf0b72c4ab remove allow-untyped-defs from _inductor/compile_worker/watchdog.py (#143941)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143941
Approved by: https://github.com/Skylion007
2024-12-29 01:05:09 +00:00
3ba6fcd3ff remove allow-untyped-defs from torch/_size_docs.py (#143942)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143942
Approved by: https://github.com/Skylion007
2024-12-29 01:00:46 +00:00
85f348578b [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#143929)
Differential Revision: D67682313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143929
Approved by: https://github.com/hl475
2024-12-28 23:39:21 +00:00
e1abbe155e remove allow-untyped-defs from ao/nn/qat/dynamic/modules/linear.py (#143919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143919
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-28 20:50:48 +00:00
3054aae493 [MPS] Fix fmin/fmax for scalar argument (#143934)
CPU scalar promotion to GPU is allowed for CUDA and shoudl be allowed for MPS as well (at the very least it should not crash)

Fixes https://github.com/pytorch/pytorch/issues/143933 https://github.com/pytorch/pytorch/issues/142203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143934
Approved by: https://github.com/Skylion007
2024-12-28 17:07:19 +00:00
45a709d9ec Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit cbc4cf3043a7316c1f6e86b1e22d96042a59489c.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see d8c3900d80/1 ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))
2024-12-28 16:49:38 +00:00
8cccc46e33 Revert "Add AOT inductor support for _scaled_mm for CPU (#141961)"
This reverts commit 3fabd10c40c632104e420ae8e3721f33176e8640.

Reverted https://github.com/pytorch/pytorch/pull/141961 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see d8c3900d80/1 ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))
2024-12-28 16:49:38 +00:00
d8c3900d80 [Inductor] Implement primitive Metal compiler (#143893)
Still work in progress, only works for element wise operations. Current implementation could be used to turn something like
```python
def f(x):
  return x[:,::2].sin() + x[:, 1::2].cos()
```
into the following shader
```python
# Topologically Sorted Source Nodes: [sin, cos, add], Original ATen: [aten.sin, aten.cos, aten.add]
# Source node to ATen node mapping:
#   add => add
#   cos => cos
#   sin => sin
# Graph fragment:
#   %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%slice_2,), kwargs = {})
#   %cos : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%slice_4,), kwargs = {})
#   %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%sin, %cos), kwargs = {})
mps_lib = torch.mps._compile_shader("""
    kernel void kernel_0(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp0 = in_ptr0[2*x0];
        auto tmp1 = metal::precise::sin(tmp0);
        auto tmp2 = in_ptr0[2*x0 + 1];
        auto tmp3 = metal::precise::cos(tmp2);
        auto tmp4 = tmp1 + tmp3;
        out_ptr0[x0] = static_cast<float>(tmp4);
    }
""")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143893
Approved by: https://github.com/jansel
ghstack dependencies: #143891, #143892
2024-12-28 06:58:32 +00:00
74028cfd0c [Inductor][CPP] Fix Data Type issue of frexp (#143746)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746
Approved by: https://github.com/jgong5
2024-12-28 06:00:13 +00:00
01980cac38 [dynamo] Make ConstDictKeySource a subclass of ChainedSource (#143924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143924
Approved by: https://github.com/jansel
2024-12-28 05:59:45 +00:00
3fabd10c40 Add AOT inductor support for _scaled_mm for CPU (#141961)
This PR is to add AOT inductor support for _scaled_mm for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141961
Approved by: https://github.com/malfet
ghstack dependencies: #139975
2024-12-28 05:57:35 +00:00
cbc4cf3043 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2024-12-28 05:49:06 +00:00
d3e9133ab2 Fix separate in process bisector cache, cleanup on exit (#143661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143661
Approved by: https://github.com/ezyang
ghstack dependencies: #143657
2024-12-28 03:20:37 +00:00
1e246ef05b [CUDA][CUDA graphs][RNG] Skip replay prologue if wholegraph_increment is 0 (#143777)
#143572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143777
Approved by: https://github.com/ngimel, https://github.com/eellison
2024-12-28 02:31:26 +00:00
4a7cf0dbff [Inductor] Add MPS device op overrides (#143892)
Mostly dummy interface as MPS backend currently limited to a single device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143892
Approved by: https://github.com/jansel
ghstack dependencies: #143891
2024-12-28 02:11:45 +00:00
ad78edee8e Add support for list, tuple and dict in numeric debugger (#143882)
Summary:
Previously numeric debugger only supports torch.Tensor, this PR adds support for list, tuple and dict as well

Test Plan:
python test/test_quantization.py -k test_extract_results_from_loggers_list_output

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D67660049](https://our.internmc.facebook.com/intern/diff/D67660049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143882
Approved by: https://github.com/dulinriley
2024-12-28 02:10:31 +00:00
c3c27aef34 [dynamo] Remove HFPretrained config hack (#143698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143698
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #143888
2024-12-28 02:03:13 +00:00
7c343a9d68 Fix emulate low precision bool inp (#143657)
Fix for https://github.com/pytorch/pytorch/issues/143502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143657
Approved by: https://github.com/BoyuanFeng
2024-12-28 01:51:28 +00:00
88ccf2fa5e remove allow-untyped-defs from distributed/elastic/multiprocessing/subprocess_handler/handlers.py (#143917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143917
Approved by: https://github.com/Skylion007
2024-12-28 00:13:05 +00:00
e3fefdfbf0 [CUTLASS] fix addmm (#143537)
We would get a CUDA IMA before because we pass Bias in for X. So, we need to re-order the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143537
Approved by: https://github.com/chenyang78
ghstack dependencies: #143528
2024-12-27 23:47:55 +00:00
b54620f40f [CUTLASS] fix bugs: extra data_ptr() call, wrong size symbol name, bias symbol not added (#143528)
A few small things in this PR:
- fixed a bug where `workspace.data_ptr().data_ptr()` showed up
- for SM80 CUTLASS kernels, the symbol size for W.size(1) was never created
- for addmm kernels, the ldc bias symbol never showed up

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143528
Approved by: https://github.com/henrylhtsang
2024-12-27 23:38:18 +00:00
c17d767686 remove allow-untyped-defs from _inductor/codegen/rocm/rocm_template_buffer.py (#143870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143870
Approved by: https://github.com/aorenste, https://github.com/Skylion007
2024-12-27 23:28:51 +00:00
63d6e1f743 remove allow-untyped-defs from _inductor/codegen/aoti_hipify_utils.py (#143916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143916
Approved by: https://github.com/Skylion007
2024-12-27 23:25:37 +00:00
928e01545c restore 'unused' variable to fix test_cuda_device_memory_allocated (#143885)
This PR fix `test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_device_memory_allocated`
by restoring a deleted 'unused' variable from commit d8c8ba2440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143885
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-27 23:18:13 +00:00
0de661dc27 Add support for differentiable weight decay (#143679)
(Actual) second PR in a larger project to broaden support for differentiable optimizers with @janeyx99!

In this PR, I did a lot of pattern matching from the previous PR to add support for differentiable weight_decay.

And also added a single new line on line 359 (previously line 352) to make the code from the last PR a little easier to read

Continuation of progress on #141832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143679
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-12-27 23:14:43 +00:00
c0c7f881da remove allow-untyped-defs from distributed/pipelining/_unflatten.py (#143915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143915
Approved by: https://github.com/aorenste, https://github.com/Skylion007, https://github.com/malfet
2024-12-27 22:21:28 +00:00
af823bd526 remove allow-untyped-defs from utils/tensorboard/_convert_np.py (#143918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143918
Approved by: https://github.com/Skylion007
2024-12-27 22:19:33 +00:00
fe398de769 [EZ] Update sympy to 1.13.3 (#143908)
And remove python>=3.9 check as it currently covers all supported python versions

Fixes https://github.com/pytorch/pytorch/issues/143907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143908
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2024-12-27 21:32:55 +00:00
b5042cfa58 Revert "remove allow-untyped-defs from torch/ao/__init__.py (#143604)"
This reverts commit 1598d458797e69376a9a148bd37fb6e8167d22e3.

Reverted https://github.com/pytorch/pytorch/pull/143604 on behalf of https://github.com/wdvr due to failing typing checks in torchao ([comment](https://github.com/pytorch/pytorch/pull/143604#issuecomment-2564043233))
2024-12-27 21:30:02 +00:00
7a13bfa1ad [EZ] Update jinja2 to 3.1.5 (#143923)
To make Dependabot happy about https://cwe.mitre.org/data/definitions/150.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143923
Approved by: https://github.com/Skylion007
2024-12-27 21:10:21 +00:00
228b228449 Fix batch-specific attention mod for NJT + Flex (#143866)
Fixes #143788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143866
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-12-27 20:51:41 +00:00
1e65dec2b9 [Dynamo] Add MPSDevice interface (#143891)
That simply checks if device is available and whether or not it supports bf16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143891
Approved by: https://github.com/jansel
2024-12-27 20:31:44 +00:00
d2f769476f [Easy] add quotes to shell activation commands (#143902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143902
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-27 19:17:46 +00:00
a87cd5283b [dynamo] Trace through overridden __getattribute__ method (#143888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143888
Approved by: https://github.com/jansel
2024-12-27 18:10:00 +00:00
fda9048ca8 remove allow-untyped-defs from distributed/elastic/multiprocessing/errors/handlers.py (#143869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143869
Approved by: https://github.com/Skylion007
2024-12-27 15:49:19 +00:00
a20765a9c1 subgraph rewriter supports matched pattern with no users (#143842)
Fixes #143841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143842
Approved by: https://github.com/yushangdi
2024-12-27 12:45:39 +00:00
9e8d84f863 Fix duplicate pattern error (#139321)
vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline:

For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321
Approved by: https://github.com/shunting314
2024-12-27 11:10:46 +00:00
3571476739 Revert "fix randint distribution for large max (#143787)"
This reverts commit 8059d56ec364feb554f3fb90012a0fc2d2104e7f.

Reverted https://github.com/pytorch/pytorch/pull/143787 on behalf of https://github.com/wdvr due to failing internal tests, to be fixed first ([comment](https://github.com/pytorch/pytorch/pull/143787#issuecomment-2563493323))
2024-12-27 09:16:36 +00:00
f6801ba4b3 Revert "Use random64 in Fischer-Yates algorithm for large N (#143682)"
This reverts commit 7013be0094e8d3ded2ba2f948082f98d63e622bb.

Reverted https://github.com/pytorch/pytorch/pull/143682 on behalf of https://github.com/wdvr due to failing Meta internal tests that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/143682#issuecomment-2563487675))
2024-12-27 09:09:33 +00:00
ba5cacbc17 [Codemod][AddExplicitStrictExportArg] caffe2/test (#143688)
Reviewed By: avikchaudhuri

Differential Revision: D67530154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143688
Approved by: https://github.com/tugsbayasgalan
2024-12-27 07:58:44 +00:00
969415885d [inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373
Approved by: https://github.com/eellison
2024-12-27 06:46:09 +00:00
cyy
379bbef23c Enable more C++ warnings (#143355)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355
Approved by: https://github.com/albanD
2024-12-27 05:46:57 +00:00
fca457b5db Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 3f80632c802f1d9fafd0c303d45ba2376b9c1e11.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2563331259))
2024-12-27 05:25:17 +00:00
0f474a960b [dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699
Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #143722
2024-12-27 04:51:35 +00:00
e296bab614 [dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
2024-12-27 04:51:35 +00:00
d60282c177 remove allow-untyped-defs from _inductor/codegen/cpu_device_op_overrides.py (#143881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143881
Approved by: https://github.com/aorenste
2024-12-27 04:10:47 +00:00
43853691bc [Quantization] add an option keep_original_weights in _lower_to_native_backend (#141049)
Differential Revision: D66153809

This diff adds an option to keep_original_weights so we can track back the original weight and bias after performing prepare_fx and convert_fx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141049
Approved by: https://github.com/jerryzh168
2024-12-27 04:02:07 +00:00
809106a93f [fr][c10d] fix flaky test (#143878)
Summary:
Test erroneously assumed that input/output sizes are same and that all
states are matchable.

Fixes issue #143798

Test Plan:
Test passes

Reviewers
Test passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143878
Approved by: https://github.com/fduwjj
ghstack dependencies: #143865
2024-12-27 03:13:15 +00:00
1cd70e7e23 [fr][c10d] log trace capture enabled or not in flight recorder (#143865)
Summary:
Refactor logging for flight recorder so we can log if the capture was
with or without stack trace capture enabled.
We introduce a new column ('trace_enabled') in the logger.

Test Plan:
Tested on local job and noted that correct output was produced.
Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865
Approved by: https://github.com/fduwjj
2024-12-27 03:07:55 +00:00
6bdf2addc5 [inductor] Simplify get_launch_args_* handling (#143835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143835
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #143813, #143814, #143815, #143817, #143818
2024-12-27 02:02:11 +00:00
138efb3002 [inductor] Move GPUTarget backwards compat to triton_compat.py (#143818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143818
Approved by: https://github.com/eellison
ghstack dependencies: #143813, #143814, #143815, #143817
2024-12-27 02:02:11 +00:00
be1936804b [inductor] Drop support for pre-ASTSource Triton (#143817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143817
Approved by: https://github.com/eellison
ghstack dependencies: #143813, #143814, #143815
2024-12-27 02:02:11 +00:00
f3d0f67039 [inductor] Minor refactor of hip compile_meta (#143815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143815
Approved by: https://github.com/eellison
ghstack dependencies: #143813, #143814
2024-12-27 02:02:11 +00:00
29841b9414 remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143871)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143871
Approved by: https://github.com/Skylion007
2024-12-27 01:20:26 +00:00
373dba35f9 remove allow-untyped-defs from fx/experimental/refinement_types.py (#143868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143868
Approved by: https://github.com/Skylion007
2024-12-27 01:00:45 +00:00
c4bff71854 [Easy] Add ROCm support to nightly pull tool (#141282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141282
Approved by: https://github.com/malfet
ghstack dependencies: #143263
2024-12-27 00:07:38 +00:00
8059d56ec3 fix randint distribution for large max (#143787)
Fixes #ISSUE_NUMBER
Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`)
This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it.
`torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better.
`__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787
Approved by: https://github.com/eqy
2024-12-26 23:54:03 +00:00
1598d45879 remove allow-untyped-defs from torch/ao/__init__.py (#143604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143604
Approved by: https://github.com/aorenste
2024-12-26 23:27:16 +00:00
3f80632c80 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #139974
2024-12-26 22:22:42 +00:00
26364428f5 Revert "[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)"
This reverts commit fe95cbe018218d159ba0a0269045b31ff72f1a20.

Reverted https://github.com/pytorch/pytorch/pull/143722 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))
2024-12-26 22:04:36 +00:00
ee25daef5a Revert "[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699)"
This reverts commit 7d1c6661397f9bff93c1ea389506c8a163d7a2ab.

Reverted https://github.com/pytorch/pytorch/pull/143699 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))
2024-12-26 22:04:35 +00:00
2966fb3708 [pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143775)
The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET.

This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc.

We also added a few ways to enable ET and ET Resources through the OS environment variables.

Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch.

Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels.

Differential Revision: [D67610542](https://our.internmc.facebook.com/intern/diff/D67610542/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67610542/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143775
Approved by: https://github.com/shengfukevin, https://github.com/wdvr
2024-12-26 21:15:39 +00:00
96e9a5aeec [CI] Disable sccache for xpu test (#143851)
WA for https://github.com/pytorch/pytorch/issues/143585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143851
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-26 19:45:04 +00:00
3df12d38cf dynamo tracing perf: cache cleaned_instructions: 33.7 -> 30.0 (#143070)
See #143056 for overall docs.

This PR: Cache the interesting/expensive bits of `cleaned_instructions()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143070
Approved by: https://github.com/jansel
2024-12-26 19:02:08 +00:00
51a7ecde80 [Easy] Bump CUDA nightly version to 11.8 / 12.4 / 12.6 in nightly pull tool (#143263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143263
Approved by: https://github.com/malfet
2024-12-26 19:01:38 +00:00
78502a58ba Enable FSDP2 on XPU device (#143737)
**Motivation:**  Enabling FSDP2 on XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143737
Approved by: https://github.com/awgu
2024-12-26 18:34:11 +00:00
475656fd9c Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 2293fe1024812d6349f6e2b3b7de82c6b73f11e4.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/malfet due to failing internal ROCM builds with error: ModuleNotFoundError: No module named hipify ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2562973920))
2024-12-26 17:32:23 +00:00
cc4e70b7c3 Revert "Use absolute path path.resolve() -> path.absolute() (#129409)"
This reverts commit 135c7db99d646b8bd9603bf969d47d3dec5987b1.

Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/malfet due to need to revert to as dependency of https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2562969825))
2024-12-26 17:26:06 +00:00
9255ffc841 Revert "Enable more C++ warnings (#143355)"
This reverts commit daa3ffe0ebff58577b8db964447b6abc6de53a25.

Reverted https://github.com/pytorch/pytorch/pull/143355 on behalf of https://github.com/malfet due to It fails internal build system as it kind of breaks separation between native and native/cpu ([comment](https://github.com/pytorch/pytorch/pull/143355#issuecomment-2562961546))
2024-12-26 17:13:10 +00:00
cf76c05b4d [inductor] Refactor conditional triton imports into triton_compat.py (#143814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143814
Approved by: https://github.com/Skylion007
ghstack dependencies: #143813
2024-12-26 09:14:06 +00:00
efac5ed81b [inductor] Reorder imports in codecache.py (#143813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143813
Approved by: https://github.com/Skylion007
2024-12-26 09:14:06 +00:00
bf8da4c145 Bump jinja2 from 3.1.4 to 3.1.5 in /.ci/docker (#143844)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/pallets/jinja/releases">jinja2's releases</a>.</em></p>
<blockquote>
<h2>3.1.5</h2>
<p>This is the Jinja 3.1.5 security fix release, which fixes security issues and bugs but does not otherwise change behavior and should not result in breaking changes compared to the latest feature release.</p>
<p>PyPI: <a href="https://pypi.org/project/Jinja2/3.1.5/">https://pypi.org/project/Jinja2/3.1.5/</a>
Changes: <a href="https://jinja.palletsprojects.com/changes/#version-3-1-5">https://jinja.palletsprojects.com/changes/#version-3-1-5</a>
Milestone: <a href="https://github.com/pallets/jinja/milestone/16?closed=1">https://github.com/pallets/jinja/milestone/16?closed=1</a></p>
<ul>
<li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as by passing a stored reference to a filter that calls its argument. <a href="https://github.com/pallets/jinja/security/advisories/GHSA-q2x7-8rv6-6q7h">GHSA-q2x7-8rv6-6q7h</a></li>
<li>Escape template name before formatting it into error messages, to avoid issues with names that contain f-string syntax. <a href="https://redirect.github.com/pallets/jinja/issues/1792">#1792</a>, <a href="https://github.com/pallets/jinja/security/advisories/GHSA-gmj6-6f8f-6699">GHSA-gmj6-6f8f-6699</a></li>
<li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence types. <a href="https://redirect.github.com/pallets/jinja/issues/2032">#2032</a></li>
<li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1952">#1952</a></li>
<li>Avoid unclosed <code>auto_aiter</code> warnings. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li>
<li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li>
<li>Avoid leaving <code>root_render_func()</code> unclosed in <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li>
<li>Avoid leaving async generators unclosed in blocks, includes and extends. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li>
<li>The runtime uses the correct <code>concat</code> function for the current environment when calling block references. <a href="https://redirect.github.com/pallets/jinja/issues/1701">#1701</a></li>
<li>Make <code>|unique</code> async-aware, allowing it to be used after another async-aware filter. <a href="https://redirect.github.com/pallets/jinja/issues/1781">#1781</a></li>
<li><code>|int</code> filter handles <code>OverflowError</code> from scientific notation. <a href="https://redirect.github.com/pallets/jinja/issues/1921">#1921</a></li>
<li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code> call. <a href="https://redirect.github.com/pallets/jinja/issues/2021">#2021</a></li>
<li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code> objects. <a href="https://redirect.github.com/pallets/jinja/issues/2025">#2025</a></li>
<li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object. <a href="https://redirect.github.com/pallets/jinja/issues/2027">#2027</a></li>
<li><code>Environment.overlay(enable_async)</code> is applied correctly. <a href="https://redirect.github.com/pallets/jinja/issues/2061">#2061</a></li>
<li>The error message from <code>FileSystemLoader</code> includes the paths that were searched. <a href="https://redirect.github.com/pallets/jinja/issues/1661">#1661</a></li>
<li><code>PackageLoader</code> shows a clearer error message when the package does not contain the templates directory. <a href="https://redirect.github.com/pallets/jinja/issues/1705">#1705</a></li>
<li>Improve annotations for methods returning copies. <a href="https://redirect.github.com/pallets/jinja/issues/1880">#1880</a></li>
<li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1870">#1870</a></li>
<li>Tests decorated with <code>@pass_context</code> can be used with the <code>|select</code> filter. <a href="https://redirect.github.com/pallets/jinja/issues/1624">#1624</a></li>
<li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the target is a namespace attribute. <a href="https://redirect.github.com/pallets/jinja/issues/1413">#1413</a></li>
<li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks does not cause the variable to be considered initially undefined. <a href="https://redirect.github.com/pallets/jinja/issues/1253">#1253</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pallets/jinja/blob/main/CHANGES.rst">jinja2's changelog</a>.</em></p>
<blockquote>
<h2>Version 3.1.5</h2>
<p>Released 2024-12-21</p>
<ul>
<li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as
by passing a stored reference to a filter that calls its argument.
:ghsa:<code>q2x7-8rv6-6q7h</code></li>
<li>Escape template name before formatting it into error messages, to avoid
issues with names that contain f-string syntax.
:issue:<code>1792</code>, :ghsa:<code>gmj6-6f8f-6699</code></li>
<li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence
types. :issue:<code>2032</code></li>
<li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>.
:pr:<code>1952</code></li>
<li>Avoid unclosed <code>auto_aiter</code> warnings. :pr:<code>1960</code></li>
<li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from
<code>Template.generate_async</code>. :pr:<code>1960</code></li>
<li>Avoid leaving <code>root_render_func()</code> unclosed in
<code>Template.generate_async</code>. :pr:<code>1960</code></li>
<li>Avoid leaving async generators unclosed in blocks, includes and extends.
:pr:<code>1960</code></li>
<li>The runtime uses the correct <code>concat</code> function for the current environment
when calling block references. :issue:<code>1701</code></li>
<li>Make <code>|unique</code> async-aware, allowing it to be used after another
async-aware filter. :issue:<code>1781</code></li>
<li><code>|int</code> filter handles <code>OverflowError</code> from scientific notation.
:issue:<code>1921</code></li>
<li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code>
call. :issue:<code>2021</code></li>
<li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code>
objects. :issue:<code>2025</code></li>
<li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object.
:issue:<code>2027</code></li>
<li><code>Environment.overlay(enable_async)</code> is applied correctly. :pr:<code>2061</code></li>
<li>The error message from <code>FileSystemLoader</code> includes the paths that were
searched. :issue:<code>1661</code></li>
<li><code>PackageLoader</code> shows a clearer error message when the package does not
contain the templates directory. :issue:<code>1705</code></li>
<li>Improve annotations for methods returning copies. :pr:<code>1880</code></li>
<li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. :pr:<code>1870</code></li>
<li>Tests decorated with <code>@pass_context`` can be used with the ``|select`` filter. :issue:</code>1624`</li>
<li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the
target is a namespace attribute. :issue:<code>1413</code></li>
<li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks
does not cause the variable to be considered initially undefined.
:issue:<code>1253</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="877f6e51be"><code>877f6e5</code></a> release version 3.1.5</li>
<li><a href="8d58859265"><code>8d58859</code></a> remove test pypi</li>
<li><a href="eda8fe86fd"><code>eda8fe8</code></a> update dev dependencies</li>
<li><a href="c8fdce1e03"><code>c8fdce1</code></a> Fix bug involving calling set on a template parameter within all branches of ...</li>
<li><a href="66587ce989"><code>66587ce</code></a> Fix bug where set would sometimes fail within if</li>
<li><a href="fbc3a696c7"><code>fbc3a69</code></a> Add support for namespaces in tuple parsing (<a href="https://redirect.github.com/pallets/jinja/issues/1664">#1664</a>)</li>
<li><a href="b8f4831d41"><code>b8f4831</code></a> more comments about nsref assignment</li>
<li><a href="ee832194cd"><code>ee83219</code></a> Add support for namespaces in tuple assignment</li>
<li><a href="1d55cddbb2"><code>1d55cdd</code></a> Triple quotes in docs (<a href="https://redirect.github.com/pallets/jinja/issues/2064">#2064</a>)</li>
<li><a href="8a8eafc6b9"><code>8a8eafc</code></a> edit block assignment section</li>
<li>Additional commits viewable in <a href="https://github.com/pallets/jinja/compare/3.1.4...3.1.5">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=jinja2&package-manager=pip&previous-version=3.1.4&new-version=3.1.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143844
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-26 05:20:06 +00:00
cyy
e05bfb8ee3 [Submodule] Bump libfmt to 11.1.0 (#143843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143843
Approved by: https://github.com/Skylion007
2024-12-26 04:49:11 +00:00
4bacfd6e11 Sort requirements.txt (#143778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143778
Approved by: https://github.com/albanD
2024-12-26 00:51:52 +00:00
cyy
f42cff4e29 [17/N] Fix extra warnings brought by clang-tidy-17 (#143804)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143804
Approved by: https://github.com/Skylion007
2024-12-25 19:54:42 +00:00
a8ac3a6b20 [inductor] fix the adaptive_avg_pool on processing int64 (#143802)
Fixes #143801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143802
Approved by: https://github.com/jansel
2024-12-25 09:08:43 +00:00
c0d710634f Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292)
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-25 02:37:11 +00:00
7013be0094 Use random64 in Fischer-Yates algorithm for large N (#143682)
Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682
Approved by: https://github.com/eqy, https://github.com/albanD
2024-12-25 01:19:19 +00:00
27b0d41f0a [ROCm] Add miopen_batch_norm to meta_registrations to fix AOTI issue (#143569)
Currently the upstream example for AOTI usage breaks on ROCm (https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html)

```
File "/root/upstream/torch/_dynamo/exc.py", line 317, in unimplemented
    raise Unsupported(msg, case_name=case_name)
torch._dynamo.exc.Unsupported: unsupported operator: aten.miopen_batch_norm.default (see https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.64r4npvq0w0 for how to fix)

from user code:
   File "/root/vision/torchvision/models/resnet.py", line 285, in forward
    return self._forward_impl(x)
  File "/root/vision/torchvision/models/resnet.py", line 269, in _forward_impl
    x = self.bn1(x)
```

This PR adds a meta_registration for miopen_batch_norm to resolve this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143569
Approved by: https://github.com/jeffdaily
2024-12-24 23:43:11 +00:00
9035fb5a7b [dynamo] Add types to exc.py (#143626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143626
Approved by: https://github.com/yanboliang
ghstack dependencies: #143552, #143610
2024-12-24 21:48:32 +00:00
3e7f9e2cc4 [inductor] Shorten tracebacks for errors inside inductor (by skipping AOTAutograd frames) (#143610)
Before #143552
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1381, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1165, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 547, in __call__
    return _compile(
           ^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 987, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 715, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 750, in _compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object
    transformations(instructions, code_options)
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 231, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 662, in transform
    tracer.run()
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run
    super().run()
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run
    while self.step():
          ^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE
    self._return(inst)
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return
    self.output.compile_subgraph(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1101, in compile_subgraph
    self.compile_and_call_fx_graph(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1382, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1432, in call_user_compiler
    return self._call_user_compiler(gm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1483, in _call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1462, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__
    return self.compiler_fn(gm, example_inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen
    self.scheduler = Scheduler(self.operations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__
    self._init(nodes)
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init
    self.nodes = self.fuse_nodes(self.nodes)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes
    nodes = self.fuse_nodes_once(nodes)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once
    assert False, "a fake error during fusion"
           ^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: a fake error during fusion

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

Before this PR
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1484, in _call_user_compiler
    raise BackendCompilerFailed(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__
    return self.compiler_fn(gm, example_inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen
    self.scheduler = Scheduler(self.operations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__
    self._init(nodes)
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init
    self.nodes = self.fuse_nodes(self.nodes)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes
    nodes = self.fuse_nodes_once(nodes)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once
    assert False, "a fake error during fusion"
           ^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: a fake error during fusion

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

After this PR
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1138, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1053, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen
    self.scheduler = Scheduler(self.operations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__
    self._init(nodes)
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init
    self.nodes = self.fuse_nodes(self.nodes)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes
    nodes = self.fuse_nodes_once(nodes)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once
    assert False, "a fake error during fusion"
           ^^^^^
torch._inductor.exc.InductorError: AssertionError: a fake error during fusion

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

A large numer of frames are removed between:
```py
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143610
Approved by: https://github.com/eellison
ghstack dependencies: #143552
2024-12-24 21:48:32 +00:00
9e5f3fdfc7 [dynamo] Shorten tracebacks for backend compiler errors (#143552)
Fixes #143406

After this PR the error for missing Triton is:
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend
    raise TritonMissing(inspect.currentframe())
torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

Setting `TORCHDYNAMO_VERBOSE=1` yields something like the old error:
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1383, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1167, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 548, in __call__
    return _compile(
           ^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 988, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 716, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 751, in _compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object
    transformations(instructions, code_options)
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 232, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 663, in transform
    tracer.run()
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run
    super().run()
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run
    while self.step():
          ^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE
    self._return(inst)
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return
    self.output.compile_subgraph(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1102, in compile_subgraph
    self.compile_and_call_fx_graph(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1383, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1433, in call_user_compiler
    return self._call_user_compiler(gm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__
    return self.compiler_fn(gm, example_inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1916, in codegen
    self.scheduler.codegen()
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3667, in codegen
    return self._codegen()
           ^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3761, in _codegen
    if device is not None and self.get_backend(device).ready_to_flush():
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3631, in get_backend
    self.backends[device] = self.create_backend(device)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend
    raise TritonMissing(inspect.currentframe())
torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

This PR also strips dynamo stack frames from other types of backend compile errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143552
Approved by: https://github.com/yanboliang
2024-12-24 21:48:23 +00:00
844e6108f6 Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)"
This reverts commit ad750ae32079020f51f9b7d01237f3ecfa83b6ff.

Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))
2024-12-24 17:22:57 +00:00
6c32ef4c5b Remove builder repo from workflows and scripts (#143776)
Part of https://github.com/pytorch/builder/issues/2054
Builder is repo is no longer used. Hence remove any references to builder repo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143776
Approved by: https://github.com/huydhn
2024-12-24 14:11:51 +00:00
aec3b46274 [DTensor] Add aten.amin/amax to linear_reduction_strategy (#143747)
In the same vein as https://github.com/pytorch/pytorch/pull/134206, these two ops still seemed missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143747
Approved by: https://github.com/kwen2501
2024-12-24 13:36:40 +00:00
b77406a9ec [BE][CI] bump ruff to 0.8.4 (#143753)
Changes:

1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
2024-12-24 12:24:10 +00:00
dbbc81cb34 Enabled force_shape_pad for test_pad_mm and test_slice_mm_bandwidth_computation (#141768)
Some tests fail for ROCm build on navi arch because of this check: f83361b274/torch/_inductor/fx_passes/pad_mm.py (L211)

There is no need to determine if mm is compute bound for most of the padding tests since they don't specifically test compute bound behavior. We don't have enough empirical data to fine tune this check for AMD gpus yet. I propose to force the shape padding for the tests that we had trouble with to avoid this unnecessary logic path.

Please correct me if I didn't add other tests that can potentially fail with this issue or if I added a test that is dependent on logic below the `force_shape_pad` check here: f83361b274/torch/_inductor/fx_passes/pad_mm.py (L444)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141768
Approved by: https://github.com/jeffdaily
2024-12-24 11:03:39 +00:00
783065637e Add FP8 support for eye (#139974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-24 10:00:23 +00:00
060ee14753 [inductor] Make adaptive_max_pool2d error on int64 (#143762)
Fixes #143752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143762
Approved by: https://github.com/yanboliang
2024-12-24 08:33:59 +00:00
135c7db99d Use absolute path path.resolve() -> path.absolute() (#129409)
Changes:

1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()`
2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409
Approved by: https://github.com/albanD
2024-12-24 08:33:08 +00:00
362ecad9bb [ROCm] Use linux.rocm.gpu.2 for 2-GPU and linux.rocm.gpu.4 for 4-GPU runners (#143769)
* Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4`
* Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point)
* Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769
Approved by: https://github.com/jeffdaily
2024-12-24 08:04:00 +00:00
1963fc83a1 [micro_pipeline_tp] don't pass return_A to fused_all_gather_scaled_matmul (#143782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143782
Approved by: https://github.com/tianyu-l
2024-12-24 07:25:38 +00:00
ad750ae320 [Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-24 05:42:36 +00:00
b0c3f48a40 [inductor] Improve error message for assert_size_stride (#143765)
```
>>> torch._C._dynamo.guards.assert_size_stride(torch.randn(10), (10,), (2,))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError: expected size 10==10, stride 1==2 at dim=0
This error most often comes from an incorrect meta function for a custom op.
See https://pytorch.org/docs/stable/library.html#torch.library.opcheck
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143765
Approved by: https://github.com/zou3519
2024-12-24 05:26:05 +00:00
ace645a017 Add support for prototype affine quantization in pt2e flow (#141421)
Summary:
duplicated affine quantization functionality including
observer (https://github.com/pytorch/ao/blob/main/torchao/quantization/observer.py)
and some quant_primitive ops (7c3c51fd0d/torchao/quantization/quant_primitives.py (L26-L30))
to allow for per group quantization min max observer in pt2e flow

Next: We can follow up to add moving average min max observer

Test Plan:
python test/test_quantization.py -k test_channel_group_quantization

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141421
Approved by: https://github.com/cccclai
2024-12-24 04:22:18 +00:00
60a0d53c13 [dynamo] Add test for #143697 (#143764)
The issue from #143697 seems to already be fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143764
Approved by: https://github.com/Skylion007
2024-12-24 03:50:15 +00:00
01d60bcf32 [Easy] Fix todo by enable tests for cuda (#143637)
Fix TODO in `test_tensor_creation_ops.py` file:

```python
# TODO: update to work on CUDA, too
```

**Test Result**

```bash
$ pytest test/test_tensor_creation_ops.py
```

![image](https://github.com/user-attachments/assets/ef829541-668e-446d-a9ab-b26b9d73085f)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/d6a46eee-1f60-48e6-898a-a8d9620eb54a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143637
Approved by: https://github.com/albanD
2024-12-24 03:47:43 +00:00
b90a3b7281 [cumsum][CUDA][64-bit indexing] Add 64-bit indexing path for cumsum (#143696)
For #143486

Interestingly enough changing the indexing type seems to degrade performance when a larger width is not needed, even on small sizes, so making this a template param rather than forcing all cases to 64-bit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143696
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-24 03:45:28 +00:00
dec4286b2d [inductor] Fix for extract_target with dots (#143766)
Fixes #143650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143766
Approved by: https://github.com/yanboliang
2024-12-24 03:42:15 +00:00
cyy
1feae27ed6 [16/N] Fix extra warnings brought by clang-tidy-17 (#143714)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143714
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-12-24 03:29:38 +00:00
49fdc52fd2 Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261)"
This reverts commit bc78b6ea4f88d673426d6de17671b82facf50beb.

Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint, plz help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2560583332))
2024-12-24 03:15:38 +00:00
cyy
d6a066ead6 Simplify host_softmax (#143251)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143251
Approved by: https://github.com/albanD
2024-12-24 02:27:51 +00:00
da21fabf34 [BE] Only print MKL version on x86 platforms (#143763)
As it will obviously be missing on ARM/S390, etc

Test plan: run `python3 -c "import torch;print(torch.__config__.parallel_info())"` on both x86 and non-x86 system
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143763
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-12-24 02:04:26 +00:00
7d1c666139 [dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699
Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #143722
2024-12-24 02:00:18 +00:00
fe95cbe018 [dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
2024-12-24 02:00:18 +00:00
67355a1289 [Easy] Add torch.range, torch.arange params optional description (#143731)
Fixes #129333

**Test Result**

**Before**

![image](https://github.com/user-attachments/assets/c5873690-7de7-4a14-9423-a150d17d137e)

![image](https://github.com/user-attachments/assets/ff4ee545-f27a-403b-bf92-51f9571022a3)

**After**

![image](https://github.com/user-attachments/assets/34e2c41f-8b54-417d-bb10-7ca6f679206a)

![image](https://github.com/user-attachments/assets/b54bcebd-70e9-4a1a-8a22-1ab815e17827)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143731
Approved by: https://github.com/janeyx99
2024-12-24 01:29:24 +00:00
0ca6a47872 Update tag_regex in filter_test_configs.py for workflows such as inductor-rocm (#143768)
This helps to make `continue-through-error`/`keep-going` work as expected on `inductor-rocm` workflow jobs.

Without this, the code here doesn't enter the `if` condition: 6ccb8ed186/.github/scripts/filter_test_configs.py (L577)

Tested via [this PR](https://github.com/pytorch/pytorch/pull/140989):
Without this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=8232e18957f987d99c946efc0cf6da9be9b52067: https://github.com/pytorch/pytorch/actions/runs/12164558045/job/34192442187#step:13:144

With this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=763179c5e421791ee05c8e2a600379b29a1c8c33: https://github.com/pytorch/pytorch/actions/runs/12261943684/job/34213300153#step:13:145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143768
Approved by: https://github.com/huydhn
2024-12-24 00:50:14 +00:00
bc78b6ea4f Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261)
Fixes #143071

Operations performed on tensors with `requires_grad=True` such as
```python
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
```
and
```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.pow(x,3)
```
are valid operations.

While an operation using `numpy` like
```python
import numpy as np

x = torch.tensor(2.0, requires_grad=True)
y = np.pow(x,3)
# > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
```
leads to an error.

However, an operation that uses `math` like
```python
import math

x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
```
does not cause an error, and `y` is no longer a tensor with a gradient!

This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models.

To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with:
```python
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
# > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
# Consider using tensor.detach() first.
```

Please let me know if you have any questions 👍
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261
Approved by: https://github.com/albanD
2024-12-24 00:22:18 +00:00
6ccb8ed186 Refactor AdamW into Adam (heavily inspired by tfsingh) (#143710)
Fixes #104899

Refactors AdamW into Adam by making AdamW a subclass of Adam. Additionally adds a test to assert that the added parameter `decoupled_weight_decay` is True in AdamW and also updates test_defaults_changed_to_foreach to account for the differences in module location for AdamW.

Heavily heavily inspired by #118857 by @tfsingh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143710
Approved by: https://github.com/janeyx99
2024-12-23 23:27:28 +00:00
4271a95590 [logging] A few fixes/updates to record_compilation_metrics (#143332)
Summary: Mostly cosmetic, but one bug fix:
* Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]"
* Sort collections in `collection_to_str`
* Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings)
* Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics

Test Plan:
```
python test/dynamo/test_structured_trace.py
python test/dynamo/test_utils.py
```
Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332
Approved by: https://github.com/ppanchalia
2024-12-23 23:10:11 +00:00
2ab698e708 allow profiling on all threads via experimentalConfig (#143659)
In some situations we want to profile calls coming from all threads (similar to on-demand), not just the thread that started profiling and the spawned threads that would inherit KinetoThreadLocal state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143659
Approved by: https://github.com/sraikund16
2024-12-23 20:41:27 +00:00
00831f9b22 [BE]: Properly forward raise pickle exception with from (#143761)
Properly raises the pickle exception with from. Provides a more informative stack trace and forwards information about the exception that led to the current exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143761
Approved by: https://github.com/XuehaiPan, https://github.com/albanD
2024-12-23 20:21:30 +00:00
75e1f8a227 [ROCm] upgrade nightly wheels to rocm6.3 - 2 of 2 (binaries) (#143613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143613
Approved by: https://github.com/jeffdaily
2024-12-23 19:47:30 +00:00
0ebc6388cf Revert "Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218)"
This reverts commit 3bfdf6f0633e6feb067e032009256c740a2a2665.

Reverted https://github.com/pytorch/pytorch/pull/143218 on behalf of https://github.com/atalman due to this constrain is ignored see https://github.com/pytorch/pytorch/issues/143654 ([comment](https://github.com/pytorch/pytorch/pull/143218#issuecomment-2560208992))
2024-12-23 19:37:35 +00:00
727ee853b4 Apply TorchFix TOR203 fixes (#143691)
Codemodded via `torchfix . --select=TOR203 --fix`.
This is a step to unblock https://github.com/pytorch/pytorch/pull/141076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143691
Approved by: https://github.com/malfet
2024-12-23 18:21:03 +00:00
c042c8a475 Use default_collate from public API (#143616)
Codemodded via `torchfix . --select=TOR104 --fix`.
This is a step to unblock https://github.com/pytorch/pytorch/pull/141076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143616
Approved by: https://github.com/malfet
2024-12-23 17:38:43 +00:00
a70191da41 Add torch.topk indices vary description (#143736)
Fixes #133542

**Test Result**

**Before**

![image](https://github.com/user-attachments/assets/65227efb-02af-45e7-804c-35588dff360d)

**After**

![image](https://github.com/user-attachments/assets/91f1f53f-008c-4784-82fe-013404e273cb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143736
Approved by: https://github.com/zou3519
2024-12-23 17:16:31 +00:00
1519a9e30b Revert "Add FP8 support for eye (#139974)"
This reverts commit 01890526b9068ae20b38b2a33e8f11a6331d7d4b.

Reverted https://github.com/pytorch/pytorch/pull/139974 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some slow tests ([comment](https://github.com/pytorch/pytorch/pull/139974#issuecomment-2560046399))
2024-12-23 17:12:39 +00:00
12662901aa [BE] Move Mac BB test to its own step (#143513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143513
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere
ghstack dependencies: #143395, #143511, #143512
2024-12-23 14:05:10 +00:00
5c4545f857 [BE][Easy] enable PYFMT for torch/[a-s]*/ (#138447)
Reproduce command:

```bash
ghstack checkout https://github.com/pytorch/pytorch/pull/138447
git checkout HEAD~1 torch/
lintrunner -a --take "PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138447
Approved by: https://github.com/ezyang
2024-12-23 14:04:00 +00:00
7314cf44ae torch/accelerator: fix device type comparison (#143541)
This was failing without the fix:
```
python -c 'import torch; d=torch.device("xpu:0"); torch.accelerator.current_stream(d)'
```
with:
```
ValueError: xpu doesn't match the current accelerator xpu.
```

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143541
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-23 10:54:53 +00:00
434e0c2104 Inductor Cutlass backend: Eliminate unused code. (#143723)
Summary: Eliminates an unused file and some smaller unused code fragments from the inductor cutlass codebase.

Test Plan: CI

Differential Revision: D67579837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143723
Approved by: https://github.com/ColinPeppler
2024-12-23 09:35:03 +00:00
01890526b9 Add FP8 support for eye (#139974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-23 06:47:49 +00:00
448c16ac87 Revert "[reland][AMD] Turn on TF32 for aten::mm (#143549)"
This reverts commit 41cdc7f73552cc8a0dbf2d3cb55440c0d6b548ea.

Reverted https://github.com/pytorch/pytorch/pull/143549 on behalf of https://github.com/malfet due to It breaks ROCM testing, see 06b4b96b34/1 ([comment](https://github.com/pytorch/pytorch/pull/143549#issuecomment-2559016960))
2024-12-23 06:47:36 +00:00
06b4b96b34 dynamo tracing perf: no re in arg_ref: 33.9 -> 33.7 (#143069)
See #143056 for overall docs.

This PR: Avoid use of python re and move valid varname check in
`GuardBuilder.arg_ref()` into C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143069
Approved by: https://github.com/jansel
2024-12-23 05:32:09 +00:00
07fa6e2c8b Fix torch.accelerator api abort when passing invaild device (#143550)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/143543

# Solution
We should raise python exception instead of aborting...

# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
  what():  device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
    return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
2024-12-23 03:44:22 +00:00
eebc93d41e Better fix for f-strings in set_linter for py3.12 (#143725)
#143628 didn't handle a few cases right for example:
```py
$ python3 tools/linter/adapters/set_linter.py torch/_inductor/scheduler.py
torch/_inductor/scheduler.py:261:24: Builtin `set` is deprecated
  259 |                 multiline=False,
  260 |             )
  261 |         return f"{self}{data_str}"
                               ^
  262 |
  263 |     def log_details(self) -> None:

torch/_inductor/scheduler.py:261:33: Builtin `set` is deprecated
  259 |                 multiline=False,
  260 |             )
  261 |         return f"{self}{data_str}"
                                        ^
  262 |
  263 |     def log_details(self) -> None:
```
also multi-line fstrings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143725
Approved by: https://github.com/yanboliang
2024-12-22 22:51:27 +00:00
41cdc7f735 [reland][AMD] Turn on TF32 for aten::mm (#143549)
Summary:
hipblaslt supports TF32, so adding the support.

Original PR https://github.com/pytorch/pytorch/pull/139869

Test Plan: CI

Differential Revision: D67431681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143549
Approved by: https://github.com/eqy
2024-12-22 21:05:05 +00:00
6425f0779d [BE] Update triton repo link (#143429)
It should be https://github.com/triton-lang/triton rather than https://github.com/openai/triton shouldn't it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143429
Approved by: https://github.com/jansel
2024-12-22 18:38:35 +00:00
a316a4581d Add mps to GPU_TYPES (#143634)
Because it is a GPU, but don't require a triton, as it does not need one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143634
Approved by: https://github.com/jansel
2024-12-22 18:37:35 +00:00
cyy
09c950cc87 Remove unused <ATen/core/Array.h> inclusion (#143701)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143701
Approved by: https://github.com/albanD
2024-12-22 14:30:11 +00:00
dc55704b48 Rename cache limit to recompile limit in configs (#143709)
This PR renames every cache_limit to recompile_limit via sed.

Old config options are maintained via Config(alias='xyz')

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709
Approved by: https://github.com/jansel
2024-12-22 10:03:57 +00:00
9bf4b1c2e9 dynamo tracing perf: c++ strip_function_call: 49.12 -> 47.77 (#143063)
See #143056 for overall docs.

This PR: Convert `strip_function_call()` into C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143063
Approved by: https://github.com/jansel
ghstack dependencies: #143057, #143062
2024-12-22 06:38:46 +00:00
3ec04d30d5 dynamo tracing perf: kill import: 50.36 -> 49.12 (#143062)
See #143056 for overall docs.

This PR: Stop importing in the body of `BuiltinVariable.call_getattr()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143062
Approved by: https://github.com/jansel
ghstack dependencies: #143057
2024-12-22 06:38:46 +00:00
f2b744b9ca dynamo tracing perf: import_module: 59.92 -> 52.9 (#143057)
See #143056 for overall docs.

This PR: Using `importlib.import_module()` within the hot path of
symbolic_convert is slow. Memoize it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143057
Approved by: https://github.com/jansel
2024-12-22 06:38:38 +00:00
f1cbf4b1b5 Enable ruff's unused variable checking everywhere in pytorch (#136965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-12-22 02:33:11 +00:00
2293fe1024 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-12-21 22:08:01 +00:00
197954e14b Revert "Handle meta tensors in FX quantization (#142262)"
This reverts commit e97b97af56204230f1030bd297dda9bc6b053a4c.

Reverted https://github.com/pytorch/pytorch/pull/142262 on behalf of https://github.com/janeyx99 due to this PR broke lint  ([comment](https://github.com/pytorch/pytorch/pull/142262#issuecomment-2558233022))
2024-12-21 20:34:09 +00:00
0666347fc4 [Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686)
Reviewed By: avikchaudhuri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686
Approved by: https://github.com/tugsbayasgalan
2024-12-21 19:56:56 +00:00
e97b97af56 Handle meta tensors in FX quantization (#142262)
Summary:
If module being quantized contains a some meta tensors and some tensors with actual device, we should not fail quantization.

Quantization should also not fail if new quantized module is created on a meta device.

Differential Revision: D66895899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142262
Approved by: https://github.com/iamzainhuda
2024-12-21 13:19:30 +00:00
cyy
daa3ffe0eb Enable more C++ warnings (#143355)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355
Approved by: https://github.com/albanD
2024-12-21 09:19:02 +00:00
e15442a9b2 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 6733045a4aaef7a8d9fb1f9f8b80f4f5f4ef1f4f.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but my first attempt to fix internal build does not fix all the cases, so let us try again ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2558043056))
2024-12-21 08:06:19 +00:00
51eacea8c4 graph module retracing without preserving MCS (#143676)
Retracing while preserving module call signatures used to be a problem because graph modules don't have submodules at given paths. This led to a number of failing retracebility tests. By not trying to wrap modules with export tracepoints we can pass most of these tests; the only exception is where you do module swapping on retraced programs, which is still not possible.

Differential Revision: [D67539304](https://our.internmc.facebook.com/intern/diff/D67539304/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143676
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
ghstack dependencies: #143664
2024-12-21 07:57:43 +00:00
cyy
d7e59c2f85 Fix cppcoreguidelines-pro-type-member-init (#141787)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141787
Approved by: https://github.com/albanD
2024-12-21 07:51:30 +00:00
7b2af25f80 [1/n] Support Dynamic Memory Budget in Auto AC (#143539)
# Summary:
Full Context: https://docs.google.com/document/d/1-j5KSbfGFJQcH4sYh7BIeJXso3zYzl5G5yFQqXdKx_o/edit?usp=sharing

tl;dr

This change introduces classes which help determine a dynamic memory budget. This will mostly be helpful for models with many implicit graph breaks.

---

New Classes:

*GraphInfoProvider*
* Takes the joint_graph as well as the input memories and runtimes and parses the graph + values into usable forms for the SolverEvaluator.

*KnapsackEvaluator*
* Provides a function: Given all of the four inputs (solver function as a callable, max_dynamic_memory_budget, min_dynamic_memory_budget, dynamic_memory_budget_pareto_granularity) it returns an approximation of the knee point of the pareto distribution.

# Test Plan:

### LintRunner

LintRunner Output: P1700445547

### Unit Tests

```
$ buck test @mode/opt //caffe2/test/functorch:test_ac_knapsack
`@mode/opt` was specified, but not found. Using file at `//mode/opt`.
This behavior is being deprecated. Please use `"@//mode/opt"` instead
File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpB6PmDS
File changed: fbsource//xplat/caffe2/test/functorch/test_ac_knapsack.py
File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpyjCiPn
20 additional file change events
Buck UI: https://www.internalfb.com/buck2/414ead46-9ede-4192-8e1a-5d3c52bdb9cc
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924710342830
Network: Up: 0B  Down: 0B  (reSessionID-159794b9-9d61-477e-8e63-9bdeaa537dca)
Analyzing targets. Remaining     0/214
Executing actions. Remaining     0/6933                                                                                                                                                                                  0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 18.5s
Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0
```

### Test Run

Updated the config:

```
      activation_memory_budget_solver: DYNAMIC_MEMORY_BUDGET_DP
```

Confirming proper execution via: [aps-fb_fm_v4_768_01_dynamic-2a792ba8af](https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-fb_fm_v4_768_01_dynamic-2a792ba8af?job_attempt=0&version=0&env=PRODUCTION)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143539
Approved by: https://github.com/jansel
2024-12-21 07:38:52 +00:00
bee47b0663 Revert "[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430)"
This reverts commit 33dd4f187dd3b54d65182d56998feae235ee48c7.

Reverted https://github.com/pytorch/pytorch/pull/143430 on behalf of https://github.com/huydhn due to The internal diff D58707846 has been backed out ([comment](https://github.com/pytorch/pytorch/pull/143430#issuecomment-2558033930))
2024-12-21 07:26:34 +00:00
47c4e01e71 [audio hash update] update the pinned audio hash (#143694)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143694
Approved by: https://github.com/pytorchbot
2024-12-21 05:42:34 +00:00
9f3c291bc3 Fix issue with setAttribute and int8_t vs int32_t variables (#143693)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693
Approved by: https://github.com/huydhn
2024-12-21 05:31:56 +00:00
518b5050c0 Fix unused-variable issues in caffe2 (#143639)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/cyyever
2024-12-21 05:27:38 +00:00
f44310097c Reuse partial reductions (#143600)
Reuse partial reductions for complete reductions. We could expand this to more cover more types of reductions, although we'd have to be a bit more careful about keeping the intermediary, partial reduction in higher precision.

Just doing the ops which do not depend on a higher compute_dtype_precision for now to cover the relevant use case initially.

Fix for https://github.com/pytorch/pytorch/issues/136267. Longer term, we should make sure cooperative reductions fuse partial and complete reductions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143600
Approved by: https://github.com/vkuzo
2024-12-21 04:44:07 +00:00
97990f476d Revert "Fix unused-variable issues in caffe2 (#143639)"
This reverts commit 23ca7c2515dd1f601926c4fd0e65513308c135a9.

Reverted https://github.com/pytorch/pytorch/pull/143639 on behalf of https://github.com/huydhn due to This is failing OSS tests ([comment](https://github.com/pytorch/pytorch/pull/143639#issuecomment-2557991297))
2024-12-21 04:30:48 +00:00
b89bfe0bac Revert "Fix issue with setAttribute and int8_t vs int32_t variables (#143693)"
This reverts commit ae3d385fcba0f91f35b2848b852d4c75f88cbd62.

Reverted https://github.com/pytorch/pytorch/pull/143693 on behalf of https://github.com/huydhn due to Sorry for reverting this change but it has a conflict with https://github.com/pytorch/pytorch/pull/143639 that is breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/143693#issuecomment-2557990508))
2024-12-21 04:27:18 +00:00
a8953c36f5 [compiled autograd] log compilation time to perfetto (#140964)
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmprli4iy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
```
[
  {
    "args": {
      "compile_id": "0/-/-",
      "graph_id": 0
    },
    "cat": "dynamo_timed",
    "name": "compiled_autograd",
    "ph": "B",
    "pid": 0,
    "tid": 0,
    "ts": 1733886868992655.8
  },
  {
    "args": {
      "compile_id": "0/-/-",
      "graph_id": 0
    },
    "cat": "dynamo_timed",
    "name": "compiled_autograd",
    "ph": "E",
    "pid": 0,
    "tid": 0,
    "ts": 1733886869130681.0
  },
  {
    "args": {
      "compile_id": "0/0/0"
    },
    "cat": "dynamo_timed",
    "name": "dynamo",
    "ph": "B",
    "pid": 0,
    "tid": 0,
    "ts": 1733886869134350.5
  },
  {
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140964
Approved by: https://github.com/masnesral
ghstack dependencies: #141907, #143175
2024-12-21 04:23:25 +00:00
c7d7eff798 Revert "[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347)"
This reverts commit efe21ee59dfdd6642cc693e69e07aa9d8be13eb9.

Reverted https://github.com/pytorch/pytorch/pull/143347 on behalf of https://github.com/huydhn due to D67118173 has been backed out internally ([comment](https://github.com/pytorch/pytorch/pull/143347#issuecomment-2557983266))
2024-12-21 04:04:16 +00:00
dabc9566c4 Revert "(MTIA) Move "empty_cache" API (#143402)"
This reverts commit c7d9f298072a3f59b39517e367c7d3d2ea30e6d9.

Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))
2024-12-21 04:01:23 +00:00
fecf03fa3f [AOTI][reland] Emit a CMakeLists.txt when package_cpp_only (#143680)
Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143680
Approved by: https://github.com/huydhn
2024-12-21 03:48:40 +00:00
b5e159270a [AOTI XPU] Replace intel compiler with g++ to build inductor CPP wrapper in runtime. (#142322)
This PR aims to removes the de pendency on Intel Compiler at Inductor runtime. Now we only need a SYCL_HOME in runtime to find the sycl headers and libs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142322
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/albanD
ghstack dependencies: #143491
2024-12-21 02:27:04 +00:00
af0e159740 [Inductor XPU] Add XPU check for is_big_gpu(). (#143491)
Fix #143472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143491
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
2024-12-21 02:27:04 +00:00
0da004f3dd [dynamo] Remove transformers ModelOutput hack (#143567)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143567
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #143548
2024-12-21 01:46:14 +00:00
4627cfd1f9 [dynamo] Support user defined dicts (#143548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143548
Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/williamwen42
2024-12-21 01:46:14 +00:00
9cb743d1f9 [easy] Set feature use for aot autograd remote cache (#143674)
Use set_feature_use for logging aot autograd cache so that dynamo_compile has this data as well as PT2 Compile Events.

Differential Revision: [D67536293](https://our.internmc.facebook.com/intern/diff/D67536293/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143674
Approved by: https://github.com/bobrenjc93
2024-12-21 01:40:18 +00:00
ffd1b53f26 [aot] refactor dynamo source and cudagraphs static idx logic (#141748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141748
Approved by: https://github.com/ezyang
2024-12-21 01:20:53 +00:00
ae3d385fcb Fix issue with setAttribute and int8_t vs int32_t variables (#143693)
Test Plan: Sandcastle

Differential Revision: D67549758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693
Approved by: https://github.com/huydhn
2024-12-21 01:19:29 +00:00
bdeee82822 unflatten isinstance (#143664)
When we unflatten, the submodules we generate (`InterpreterModule` or `InterpreterModuleDispatcher`) are not related by type to the original submodules `N`. This makes `isinstance(mod, N)` checks fail. Since we do not have the original types after export, the best we can do is expose a `type_name()` method that carries the original type name, which we do carry in `nn_module_stack` entries.

Differential Revision: [D67526542](https://our.internmc.facebook.com/intern/diff/D67526542/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143664
Approved by: https://github.com/tugsbayasgalan
2024-12-21 01:07:10 +00:00
d88ebbf822 cleanup chromium event log on dynamo exit rather than on entry (#143175)
clearing at dynamo start is an issue because it throws away events from compiled autograd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143175
Approved by: https://github.com/Skylion007, https://github.com/jamesjwu
ghstack dependencies: #141907
2024-12-21 00:41:24 +00:00
4ee166b82f [ca] add compiled autograd to CompileId (#141907)
tlparse PR: https://github.com/ezyang/tlparse/pull/83

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141907
Approved by: https://github.com/ezyang
2024-12-21 00:41:24 +00:00
0ce233b8ca Support tensor subclass unwrapping (#141941)
This PR adds support for export to unwrap/wrap subclasses AOT so that we can trace through subclass parameters. This will resolve the UX issue in torchao where users had to manually unwrap their subclasses before calling export.

Differential Revision: [D67531057](https://our.internmc.facebook.com/intern/diff/D67531057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141941
Approved by: https://github.com/bdhirsh
2024-12-21 00:29:31 +00:00
553031fb9a [BE] Remove gcc-5 workaround for unused args (#143685)
ditto

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143685
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/atalman
2024-12-21 00:18:15 +00:00
ad7ab5ef84 Revert "[logging] A few fixes/updates to record_compilation_metrics (#143332)"
This reverts commit a9c753bbc88bfdc0e77f66956b3a11e405235d0f.

Reverted https://github.com/pytorch/pytorch/pull/143332 on behalf of https://github.com/malfet due to Surprisingly failure is caused by this PR ([comment](https://github.com/pytorch/pytorch/pull/143332#issuecomment-2557899120))
2024-12-21 00:06:44 +00:00
bf7009d839 [rpc] Fix unit test after c10::nullopt removal (#143690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143690
Approved by: https://github.com/yifuwang, https://github.com/c-p-i-o, https://github.com/XilunWu
2024-12-20 23:36:07 +00:00
eqy
912d6a2867 [CUDA] Bump tolerances in test_svd_lowrank_cuda_float64 (#143049)
pre-emptive bump for apparent noisy failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143049
Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/nikitaved
2024-12-20 23:25:21 +00:00
8960cb5809 Add support for bfloat16 atomic adds in fbcode (#143629)
Reland https://github.com/pytorch/pytorch/pull/141857 and fallback on A100 which doesn't have bfloat16 atomic add instrs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143629
Approved by: https://github.com/eellison
2024-12-20 23:05:13 +00:00
a3b04d473e [ROCm] Update setup-rocm for almalinux-based images (#143590)
Needed for https://github.com/pytorch/test-infra/pull/6003 and https://github.com/pytorch/ao/pull/999

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143590
Approved by: https://github.com/atalman

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-12-20 22:48:54 +00:00
23ca7c2515 Fix unused-variable issues in caffe2 (#143639)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-12-20 22:30:58 +00:00
6e58c37542 c10d: no call_guard in init (#143598)
`py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe.

For more details see #143593

https://github.com/pybind/pybind11/issues/5473

Test plan:

```
python setup.py develop
```

CI

```py
import time
from concurrent.futures import ThreadPoolExecutor
from torch import distributed as dist

def run():
    store = dist.TCPStore(
        host_name="localhost",
        port=0,
        is_master=True,
        wait_for_workers=False,
    )

    # this sleep is required to trigger the crash
    time.sleep(0.1)
    del store

futures = []
with ThreadPoolExecutor(
    max_workers=100,
) as executor:
    for i in range(100000):
        print(i)
        futures.append(executor.submit(run))
        if len(futures) > 100:
            futures.pop(0).result()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598
Approved by: https://github.com/c-p-i-o
2024-12-20 22:23:36 +00:00
a9c753bbc8 [logging] A few fixes/updates to record_compilation_metrics (#143332)
Summary: Mostly cosmetic, but one bug fix:
* Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]"
* Sort collections in `collection_to_str`
* Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings)
* Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics

Test Plan:
```
python test/dynamo/test_structured_trace.py
python test/dynamo/test_utils.py
```
Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332
Approved by: https://github.com/ppanchalia
2024-12-20 21:42:32 +00:00
372b023eb1 Fix test_serialization_zipfile_actually_jit when weights_only is not default (#143668)
Fails in fbcode where weights_only isn't default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143668
Approved by: https://github.com/awgu
ghstack dependencies: #143326, #143403
2024-12-20 21:25:10 +00:00
33dd4f187d [pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430)
The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET.

This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc.

We also added a few ways to enable ET and ET Resources through the OS environment variables.

Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch.

Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels.

Differential Revision: [D58707846](https://our.internmc.facebook.com/intern/diff/D58707846/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143430
Approved by: https://github.com/shengfukevin, https://github.com/sraikund16
2024-12-20 21:20:32 +00:00
cee06e74ee Apply clang-format for ATen/core/dispatch headers (#143620)
Code change via add path config in `.lintrunner.toml` file and running

```bash
 $ lintrunner -a --take CLANGFORMAT --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143620
Approved by: https://github.com/malfet
2024-12-20 21:16:23 +00:00
8e483654cb Add config.save.use_pinned_memory_for_d2h to serialization config (#143342)
This was benchmarked with two separate scripts on my A100
(A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save``
(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save`
Timings are an average of 5 runs and benchmark scripts + results are attached

Under both scenarios, we see **~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``)** compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False``

(A)  Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 28.54s | 20.76s |
| `compute_crc_32 = False` | 22.57s |  **14.51s** |

(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 8.38s | 5.53s |
| `compute_crc_32 = False` | 6.94s |  **3.99s** |

Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False`
<img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" />

Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True`
<img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342
Approved by: https://github.com/albanD
ghstack dependencies: #143324
2024-12-20 21:01:18 +00:00
3f63b742e6 Refactor serialization getter/setters into torch.utils.serialization.config (#143324)
Consolidate
- get/set_default_load_endianness
- get/set_default_mmap_options
- get/set_crc32_options

into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC)

In #143459 I add the local (argument style) config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324
Approved by: https://github.com/albanD
2024-12-20 21:01:17 +00:00
629de988df Fix old-compiler-unfriendly zero init of bfloat16_t array (#143504)
clang versions before 17 don't like to assign 0 to a bfloat16_t. gcc versions before 13 also won't assign 0.0 to a bfloat16_t. (Citation: https://godbolt.org/z/Gzs5ebdej)

Differential Revision: [D67396740](https://our.internmc.facebook.com/intern/diff/D67396740/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143504
Approved by: https://github.com/malfet
2024-12-20 20:49:51 +00:00
485497e727 [c10d][fr] flight recorder improvements (#143446)
Summary:
1. Flight recorder dumps are now automatically dumped by default upon
   timeout or exception. Users don't need to opt-in.
2. Change default dump location to running user's home directory
   `.cache` folder.

Test Plan:
1. Tested locally by running the crash program from flight recorder
   tutorial page.
   https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html#an-end-to-end-example
2. Noted that flight recorder files were correctly created.
❯ pwd
/home/cpio/.cache/fr_trace
❯ ls
nccl_trace_rank_0  nccl_trace_rank_1

Differential Revision: [D67363720](https://our.internmc.facebook.com/intern/diff/D67363720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143446
Approved by: https://github.com/d4l3k
2024-12-20 20:41:30 +00:00
a94f259a69 pgo: Log feature use (#142819)
This will cause dynamo_compile to popualte the feature column if we have
a hit for PGO.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142819
Approved by: https://github.com/ezyang
2024-12-20 20:22:20 +00:00
8ce0bc282a dynamo tracing perf: bytecode_transform improvements: 34.86 -> 33.9 (#143068)
See #143056 for overall docs.

This PR: Use slots on InstructionExnTabEntry and Instruction.  Stop doing python
version checks in the middle of `convert_instruction()` and
`inst_has_op_bits()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143068
Approved by: https://github.com/jansel
ghstack dependencies: #143065, #143067
2024-12-20 20:06:42 +00:00
5feb2d7b41 dynamo tracing perf: don't call expensive _set_guard_export_info if it's a duplicate guard: 37.66 -> 34.86 (#143067)
See #143056 for overall docs.

This PR: Move the call to `_set_guard_export_info()` after the duplicate guard
check in `GuardBuilder.DUPLICATE_INPUT()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143067
Approved by: https://github.com/jansel
ghstack dependencies: #143065
2024-12-20 20:06:42 +00:00
7d4e7fbfc1 dynamo tracing perf: no import on hot path: 47.62 -> 47.26 (#143065)
See #143056 for overall docs.

This PR: Removed another `import` in the body of the hot path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143065
Approved by: https://github.com/jansel
2024-12-20 20:06:42 +00:00
792e6184c5 [GPT-fast] Support run spcific model or micro-benchmark (#143607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143607
Approved by: https://github.com/BoyuanFeng, https://github.com/jerryzh168, https://github.com/huydhn
2024-12-20 19:58:07 +00:00
94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
b5475d334e [inductor] Fix an unused variable in cpu_vec_isa.py (#138473)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138473
Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/xuhancn
2024-12-20 18:50:19 +00:00
5a69c2a649 [BE][Sparse] Get rid of gcc-5 workaround (#143653)
Discovered those comments while looking at https://github.com/pytorch/pytorch/pull/143620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143653
Approved by: https://github.com/albanD
2024-12-20 18:40:45 +00:00
a5ed499f6a FlexAttention Benchmark (#139665)
1. Add alibi, sliding window, tahn softcap, prefixLM, and document_mask from attn_gym to benchmark.

2. Add comparison to different SDPA backends & FAv2, FAv3, FAKV.

Dependent on https://github.com/pytorch/pytorch/pull/139639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139665
Approved by: https://github.com/drisspg
2024-12-20 17:52:24 +00:00
c7d9f29807 (MTIA) Move "empty_cache" API (#143402)
Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`.

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api
```

https://www.internalfb.com/intern/testinfra/testrun/13510798943184259

Reviewed By: nautsimon

Differential Revision: D67148738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402
Approved by: https://github.com/nautsimon
2024-12-20 17:39:06 +00:00
d79fbf6b6d test/dynamo/test_utils: logging - Stop testing for impossible things. (#143535)
We don't support assigning to objects or numeric constants at the top level in
config modules, no need to test for them.

(This specifically breaks later sorting refactoring, since it requires <
to be implemented).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143535
Approved by: https://github.com/ppanchalia
2024-12-20 17:21:49 +00:00
f5af87c23c Make Inductor cpp backend enable_floating_point_contract_flag to take string (#143450)
Differential Revision: D66269001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143450
Approved by: https://github.com/desertfire
2024-12-20 16:28:54 +00:00
7ab880bc5e fix typo in autocast header (#143625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143625
Approved by: https://github.com/mlazos
ghstack dependencies: #143592
2024-12-20 16:17:15 +00:00
4f8b7c4272 Revert "refactor tensorify restart logic to use sources (#141517)" (#143623)
This reverts commit 30d8b30db7eaaa254d97077ac6515cdc4568fd6d.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143623
Approved by: https://github.com/mlazos
2024-12-20 15:38:34 +00:00
607884c9af [Inductor][CPP] Fix bitwise shift with corner inputs (#143635)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: 29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501) at these corner inputs.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635
Approved by: https://github.com/jgong5
2024-12-20 13:47:40 +00:00
7bf3b7cdc5 Rewrite _reparametrize_module to use contextmanager (#138203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203
Approved by: https://github.com/zou3519
ghstack dependencies: #136033, #140604
2024-12-20 12:02:27 +00:00
1c817fe671 Set enable_trace_contextlib_contextmanager flag to True (#140604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604
Approved by: https://github.com/zou3519
ghstack dependencies: #136033
2024-12-20 12:02:27 +00:00
673cc88fd6 Add support for contextmanager in Dynamo (#136033)
Fixes #130559

* Intro

This PR adds support for `@contextmanager` in Dynamo. We chose to limit the
scope of this work to only `@contextmanager` and plan to handle generators fully
in #141055 (still in draft).

* Motivation

Dynamo lacks support for generator functions. When it encounters one, it traces
it as if it were a regular function. This is problematic because it can lead to
incorrect behavior. To illustrate, consider the test case below:

```python
import torch
import contextlib

@contextlib.contextmanager
def set_default_dtype(dtype):
    old_dtype = torch.get_default_dtype()
    try:
        torch.set_default_dtype(dtype)
        yield
    finally:
        torch.set_default_dtype(old_dtype)

@torch.compile(backend="eager", fullgraph=True)
def fn():
    with set_default_dtype(torch.float64):
        x = torch.tensor([3.0, 3.0 + 5.0j])
    return x
```

Before this work, Dynamo would not stop at the `yield`, and the graph produced
would contain both calls to `set_default_dtype` executed one after the other.
This is incorrect because the context manager should execute code before and
after the `yield`.

* List of changes

`YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control
flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE`
behaves differently in a generator function. Unlike regular functions, where
`RETURN_VALUE` indicates the final result, in generators it signifies that the
generator is exhausted and implicitly raises `StopIteration`.

A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable`
was introduced to handle `@contextmanager`. This variable tracker acts not just
as a wrapper for the original function but also maintains an internal `tx`
(InstructionTranslator) object to suspend and return control flow to the parent
tracer when a `yield` is encountered.

* Corner cases

Returning a context manager from a compiled function is not supported. This
would require PyTorch to synchronize the generator state between Dynamo and the
interpreter. Any attempt to return it will result in an `IncorrectUsage`
exception.

Graph breaks require special handling as well. In the event of a graph break,
the frame associated with the context manager is skipped, and the context
manager runs in eager mode.

* This PR is breaking my code

There is a configuration flag (`enable_trace_contextlib`) that can be set to
`False` to disable tracing context managers. If this still causes crashes,
please revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033
Approved by: https://github.com/zou3519
2024-12-20 12:02:20 +00:00
04b26ee1e8 Fix false positive from f-strings in set_linter (#143628)
This linter was going crazy in python 3.12, example:
```py
$ python3 tools/linter/adapters/set_linter.py torch/_inductor/runtime/triton_heuristics.py
torch/_inductor/runtime/triton_heuristics.py:192:25: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:27: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                  ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:29: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                    ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:31: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                      ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:195:17: Builtin `set` is deprecated
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
                        ^
  196 |         f.write(f"{kernel_name} | {args_str}\n")
  197 |

torch/_inductor/runtime/triton_heuristics.py:195:26: Builtin `set` is deprecated
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
                                 ^
  196 |         f.write(f"{kernel_name} | {args_str}\n")
  197 |

torch/_inductor/runtime/triton_heuristics.py:196:19: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                          ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:31: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                      ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:35: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                          ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:44: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                                   ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:729:26: Builtin `set` is deprecated
  727 |         exec(
  728 |             f"""
  729 |             def launcher({', '.join(def_args)}, grid, stream):
                                 ^
  730 |                 if callable(grid):
  731 |                     grid_0, grid_1, grid_2 = grid(grid_meta)

torch/_inductor/runtime/triton_heuristics.py:729:46: Builtin `set` is deprecated
  727 |         exec(
  728 |             f"""
  729 |             def launcher({', '.join(def_args)}, grid, stream):
                                                     ^
  730 |                 if callable(grid):
  731 |                     grid_0, grid_1, grid_2 = grid(grid_meta)

torch/_inductor/runtime/triton_heuristics.py:735:24: Builtin `set` is deprecated
  733 |                     grid_0, grid_1, grid_2 = grid
  734 |
  735 |                 args = {', '.join(call_args)},
                               ^
  736 |                 launch_args = get_launch_args(
  737 |                     grid, grid_0, grid_1, grid_2, stream, function,

torch/_inductor/runtime/triton_heuristics.py:735:45: Builtin `set` is deprecated
  733 |                     grid_0, grid_1, grid_2 = grid
  734 |
  735 |                 args = {', '.join(call_args)},
                                                    ^
  736 |                 launch_args = get_launch_args(
  737 |                     grid, grid_0, grid_1, grid_2, stream, function,

torch/_inductor/runtime/triton_heuristics.py:1144:20: Builtin `set` is deprecated
 1142 |     cur_file = inspect.stack()[1].filename
 1143 |     summary_str = (
 1144 |         f"SUMMARY ({cur_file})\n"
                           ^
 1145 |         f"{overall_time:.2f}ms   \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s"
 1146 |     )

torch/_inductor/runtime/triton_heuristics.py:1144:29: Builtin `set` is deprecated
 1142 |     cur_file = inspect.stack()[1].filename
 1143 |     summary_str = (
 1144 |         f"SUMMARY ({cur_file})\n"
                                    ^
 1145 |         f"{overall_time:.2f}ms   \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s"
 1146 |     )

torch/_inductor/runtime/triton_heuristics.py:1162:61: Builtin `set` is deprecated
 1160 |                 )
 1161 |                 file.write("====================\n")
 1162 |                 file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n")
                                                                    ^
 1163 |                 for ms, num_gb, gb_per_s, kernel_name in sorted_calls:
 1164 |                     # also display the runtime percentage for each kernel

torch/_inductor/runtime/triton_heuristics.py:1162:70: Builtin `set` is deprecated
 1160 |                 )
 1161 |                 file.write("====================\n")
 1162 |                 file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n")
                                                                             ^
 1163 |                 for ms, num_gb, gb_per_s, kernel_name in sorted_calls:
 1164 |                     # also display the runtime percentage for each kernel

torch/_inductor/runtime/triton_heuristics.py:1166:36: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                           ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:47: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                      ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:52: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                           ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:64: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                                       ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1175:30: Builtin `set` is deprecated
 1173 |                     )
 1174 |                     file.write(bw_info_str + "\n")
 1175 |                 file.write(f"{summary_str}\n\n")
                                     ^
 1176 |         except Exception as e:
 1177 |             log.warning(

torch/_inductor/runtime/triton_heuristics.py:1175:42: Builtin `set` is deprecated
 1173 |                     )
 1174 |                     file.write(bw_info_str + "\n")
 1175 |                 file.write(f"{summary_str}\n\n")
                                                 ^
 1176 |         except Exception as e:
 1177 |             log.warning(

torch/_inductor/runtime/triton_heuristics.py:1205:29: Builtin `set` is deprecated
 1203 |         else:
 1204 |             possible_names = _find_names(self)
 1205 |             kernel_name = f"{max(possible_names, key=len)}"
                                    ^
 1206 |             if not re.match(self.regex_filter, kernel_name):
 1207 |                 return

torch/_inductor/runtime/triton_heuristics.py:1205:58: Builtin `set` is deprecated
 1203 |         else:
 1204 |             possible_names = _find_names(self)
 1205 |             kernel_name = f"{max(possible_names, key=len)}"
                                                                 ^
 1206 |             if not re.match(self.regex_filter, kernel_name):
 1207 |                 return

torch/_inductor/runtime/triton_heuristics.py:1241:60: Builtin `set` is deprecated
 1239 |                     "%s",
 1240 |                     create_bandwidth_info_str(
 1241 |                         ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}"
                                                                   ^
 1242 |                     ),
 1243 |                 )

torch/_inductor/runtime/triton_heuristics.py:1241:72: Builtin `set` is deprecated
 1239 |                     "%s",
 1240 |                     create_bandwidth_info_str(
 1241 |                         ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}"
                                                                               ^
 1242 |                     ),
 1243 |                 )

torch/_inductor/runtime/triton_heuristics.py:1256:15: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                      ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:42: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                 ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:44: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                   ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:58: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                 ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:60: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                   ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:75: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                                  ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1377:23: Builtin `set` is deprecated
 1375 |         if numel is None:
 1376 |             continue
 1377 |         block = cfg[f"{label}BLOCK"]
                              ^
 1378 |         if numel == 1:
 1379 |             assert block == 1, (

torch/_inductor/runtime/triton_heuristics.py:1377:29: Builtin `set` is deprecated
 1375 |         if numel is None:
 1376 |             continue
 1377 |         block = cfg[f"{label}BLOCK"]
                                    ^
 1378 |         if numel == 1:
 1379 |             assert block == 1, (

torch/_inductor/runtime/triton_heuristics.py:1381:24: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                               ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:38: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                             ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:46: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                     ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:52: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                           ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:58: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                 ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:64: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                       ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:71: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                              ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:77: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                    ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:84: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                           ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:88: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                               ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1384:52: Builtin `set` is deprecated
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
                                                           ^
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"

torch/_inductor/runtime/triton_heuristics.py:1384:58: Builtin `set` is deprecated
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
                                                                 ^
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"

torch/_inductor/runtime/triton_heuristics.py:1386:45: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                    ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:51: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                          ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:66: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                                         ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:80: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                                                       ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1387:20: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                           ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:26: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                 ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:33: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                        ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:39: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                              ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:45: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                    ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:59: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                  ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:61: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                    ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:71: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                              ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:78: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                                     ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:82: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                                         ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1402:19: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                          ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:23: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                              ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:46: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                     ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:56: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                               ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:67: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                                          ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:71: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                                              ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1551:21: Builtin `set` is deprecated
 1549 |     rnumels = {}
 1550 |     for idx in range(num_reduction_dims - 1, -1, -1):
 1551 |         prefix = f"r{idx}_"
                            ^
 1552 |         max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()])
 1553 |         dim = min(max_size, remaining)

torch/_inductor/runtime/triton_heuristics.py:1551:25: Builtin `set` is deprecated
 1549 |     rnumels = {}
 1550 |     for idx in range(num_reduction_dims - 1, -1, -1):
 1551 |         prefix = f"r{idx}_"
                                ^
 1552 |         max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()])
 1553 |         dim = min(max_size, remaining)

torch/_inductor/runtime/triton_heuristics.py:1556:34: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                         ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:38: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                             ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:67: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                                                          ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:77: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                                                                    ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1564:38: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                             ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:46: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                     ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:57: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                                ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:59: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                                  ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:37: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                            ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:45: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                    ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:49: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                        ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:60: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                                   ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1746:49: Builtin `set` is deprecated
 1744 |
 1745 |     if not configs:
 1746 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                        ^
 1747 |     return cached_autotune(
 1748 |         size_hints,

torch/_inductor/runtime/triton_heuristics.py:1746:60: Builtin `set` is deprecated
 1744 |
 1745 |     if not configs:
 1746 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                                   ^
 1747 |     return cached_autotune(
 1748 |         size_hints,

torch/_inductor/runtime/triton_heuristics.py:1928:32: Builtin `set` is deprecated
 1926 |         for prefix in size_hints:
 1927 |             if prefix_is_reduction(prefix):
 1928 |                 c.kwargs.pop(f"{prefix.upper()}BLOCK")
                                       ^
 1929 |
 1930 |     if disable_pointwise_autotuning(inductor_meta):

torch/_inductor/runtime/triton_heuristics.py:1928:47: Builtin `set` is deprecated
 1926 |         for prefix in size_hints:
 1927 |             if prefix_is_reduction(prefix):
 1928 |                 c.kwargs.pop(f"{prefix.upper()}BLOCK")
                                                      ^
 1929 |
 1930 |     if disable_pointwise_autotuning(inductor_meta):

torch/_inductor/runtime/triton_heuristics.py:1975:49: Builtin `set` is deprecated
 1973 |     assert triton_meta is not None
 1974 |     if len(size_hints) != 2:
 1975 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                        ^
 1976 |
 1977 |     configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta)

torch/_inductor/runtime/triton_heuristics.py:1975:60: Builtin `set` is deprecated
 1973 |     assert triton_meta is not None
 1974 |     if len(size_hints) != 2:
 1975 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                                   ^
 1976 |
 1977 |     configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta)

torch/_inductor/runtime/triton_heuristics.py:2082:56: Builtin `set` is deprecated
 2080 |         xnumel, ynumel, znumel = numels[2], numels[1], numels[0]
 2081 |     else:
 2082 |         raise AssertionError(f"invalid size for numels {len(numels)}")
                                                               ^
 2083 |
 2084 |     def get_grid_dim(numel, block):

torch/_inductor/runtime/triton_heuristics.py:2082:68: Builtin `set` is deprecated
 2080 |         xnumel, ynumel, znumel = numels[2], numels[1], numels[0]
 2081 |     else:
 2082 |         raise AssertionError(f"invalid size for numels {len(numels)}")
                                                                           ^
 2083 |
 2084 |     def get_grid_dim(numel, block):

torch/_inductor/runtime/triton_heuristics.py:2104:57: Builtin `set` is deprecated
 2102 |             torch._check(
 2103 |                 y_grid <= max_y_grid,
 2104 |                 lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue",
                                                                ^
 2105 |             )
 2106 |

torch/_inductor/runtime/triton_heuristics.py:2104:64: Builtin `set` is deprecated
 2102 |             torch._check(
 2103 |                 y_grid <= max_y_grid,
 2104 |                 lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue",
                                                                       ^
 2105 |             )
 2106 |

torch/_inductor/runtime/triton_heuristics.py:2113:43: Builtin `set` is deprecated
 2111 |         )
 2112 |
 2113 |     setattr(grid_fn, "grid_fn_str", f"grid{numels}")  # noqa: B010
                                                  ^
 2114 |
 2115 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2113:50: Builtin `set` is deprecated
 2111 |         )
 2112 |
 2113 |     setattr(grid_fn, "grid_fn_str", f"grid{numels}")  # noqa: B010
                                                         ^
 2114 |
 2115 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2122:48: Builtin `set` is deprecated
 2120 |         return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1)
 2121 |
 2122 |     grid_fn_str = f"cooperative_reduction_grid({xnumel})"
                                                       ^
 2123 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2124 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2122:55: Builtin `set` is deprecated
 2120 |         return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1)
 2121 |
 2122 |     grid_fn_str = f"cooperative_reduction_grid({xnumel})"
                                                              ^
 2123 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2124 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2135:54: Builtin `set` is deprecated
 2133 |     coop_grid = cooperative_reduction_grid(xnumel)
 2134 |     normal_grid = grid(xnumel)
 2135 |     grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})"
                                                             ^
 2136 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2137 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2135:61: Builtin `set` is deprecated
 2133 |     coop_grid = cooperative_reduction_grid(xnumel)
 2134 |     normal_grid = grid(xnumel)
 2135 |     grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})"
                                                                    ^
 2136 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2137 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2145:37: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                            ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:44: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                   ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:47: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                      ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:54: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                             ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2173:42: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                 ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:53: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                            ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:66: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                                         ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:77: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                                                    ^
 2174 |     else:
 2175 |         # sequential dispatch
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143628
Approved by: https://github.com/yanboliang, https://github.com/rec
2024-12-20 11:45:26 +00:00
6733045a4a export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-20 11:42:09 +00:00
b539c61631 [Hierarchical Compile] Update NoneAsConstantBuffer to support graph d… (#143531)
Fixes issues I hit while running graph deduplication with torch tune.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143531
Approved by: https://github.com/eellison
2024-12-20 09:23:12 +00:00
f9f82ca48f [ts converter] use Dim.AUTO for ts -> export converter (#138273)
Switches TS converter to use `Dim.AUTO` by default, exporting models with max dynamism. Adds runtime input tests to `test_converter.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138273
Approved by: https://github.com/avikchaudhuri
2024-12-20 07:48:24 +00:00
270ad513c8 [Dynamo] only import einops if version is lower than 0.7.0 (#142847)
Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847
Approved by: https://github.com/zou3519
2024-12-20 07:46:49 +00:00
29b586bbad fix formatting in programming model doc (#143587)
Test Plan: Some of the formatting in https://docs-preview.pytorch.org/pytorch/pytorch/143546/export.programming_model.html is broken.

Differential Revision: D67458972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143587
Approved by: https://github.com/yushangdi
2024-12-20 07:09:19 +00:00
fe0f20615c [DynamoBench] Handle accuracy results in benchmark records (#143611)
I discovered this issue when trying to search for the accuracy results on the database and couldn't find any.  It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers.

ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves.  So, the remaining option is to store this in the `extra_info` field.  This field is a dictionary, so it can goes there.

### Testing

https://github.com/pytorch/pytorch/actions/runs/12421747715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611
Approved by: https://github.com/kit1980
2024-12-20 06:43:38 +00:00
132fcf4e0d [user triton] Raise an exception when encountering nested @triton.autotune decorators or @triton.heuristics (#143519)
We support running a single Autotuner for each Triton kernel. Currently,
if there are multiple autotuning decorators, the subsequent ones will be
silently ignored.

Instead, we should raise an error here to avoid silent incorrectness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143519
Approved by: https://github.com/aakhundov
2024-12-20 06:38:45 +00:00
71479a9b9c Revert "[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352)"
This reverts commit 429f4cd1408b11a7b0dd10634b46b3265dc31af1.

Reverted https://github.com/pytorch/pytorch/pull/143352 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/143352#issuecomment-2556365140))
2024-12-20 06:21:31 +00:00
4e29e4aa63 [BE] Add a test to ensure grads are never inplaced into accidentally (#143612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143612
Approved by: https://github.com/soulitzer
2024-12-20 06:15:08 +00:00
2daa666591 update kineto to XPU Windows fixed PR. [submodule kineto] (#143445)
Include XPU Windows Fixed PR: https://github.com/pytorch/kineto/pull/1012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143445
Approved by: https://github.com/sraikund16
2024-12-20 05:57:30 +00:00
217a4ddb04 Add range check embedding_bag on input index >= 0 of cuda device (#140791)
Fixes #89362

**Test Result**

**Before**

```
>>> import torch
>>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda()
>>> weight = torch.rand([2, 3], dtype=torch.float32).cuda()
>>> print(torch.nn.functional.embedding_bag(input, weight))
tensor([[0., 0., 0.]], device='cuda:0')
```

**After**

```python
>>> import torch
>>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda()
>>> weight = torch.rand([2, 3], dtype=torch.float32).cuda()
>>> print(torch.nn.functional.embedding_bag(input, weight))
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [1,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [2,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 357, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 146, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

```

```bash
$ pytest test/nn/test_embedding.py
```
![image](https://github.com/user-attachments/assets/6a5ec759-a3dc-4d51-9e5e-ec79c0aac526)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/2ce4ac24-74fb-4181-9510-18b96a2c2acb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140791
Approved by: https://github.com/eqy
2024-12-20 05:47:26 +00:00
9713a6eeca remove allow-untyped-defs from torch/fx/experimental/refinement_types.py (#143602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143602
Approved by: https://github.com/aorenste
2024-12-20 05:40:52 +00:00
78d294379a remove allow-untyped-defs from torch/_lazy/config.py (#143603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143603
Approved by: https://github.com/aorenste
2024-12-20 05:34:19 +00:00
cb4e9888df remove allow-untyped-defs from torch/ao/quantization/experimental/APoT_tensor.py (#143601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143601
Approved by: https://github.com/aorenste
2024-12-20 05:26:09 +00:00
dd346dbeab remove allow-untyped-defs from torch/distributed/elastic/multiprocessing/errors/handlers.py (#143605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143605
Approved by: https://github.com/aorenste
2024-12-20 05:25:01 +00:00
fd23cf5848 [Dynamo] check node class first for graph dedup (#143609)
as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143609
Approved by: https://github.com/williamwen42
2024-12-20 04:09:46 +00:00
1c2593f035 [dynamo] guard global autocast state (#143592)
Fixes https://github.com/pytorch/pytorch/issues/112260.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143592
Approved by: https://github.com/jansel
2024-12-20 03:30:54 +00:00
d339f1506b Add cutlass version guard in prep for upgrade (#143551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143551
Approved by: https://github.com/eqy
2024-12-20 02:40:02 +00:00
75661f2036 try root fix for FP8 tensor (#143248)
Fixes #143194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143248
Approved by: https://github.com/fegin
2024-12-20 01:57:17 +00:00
4462cc6375 Revert "[Inductor] inplace padding (#140249)"
This reverts commit 297ce776363cc4802fa74d210fced2b4128960d5.

Reverted https://github.com/pytorch/pytorch/pull/140249 on behalf of https://github.com/huydhn due to This break an internal test https://fburl.com/test/ppl2we5l ([comment](https://github.com/pytorch/pytorch/pull/140249#issuecomment-2556079406))
2024-12-20 01:30:27 +00:00
e1b4635504 remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143606
Approved by: https://github.com/aorenste
2024-12-20 01:26:51 +00:00
a0cff096bc Improve cond error messaging (#143595)
Discovered by @drisspg and I trying out a simple toy example and being way too confused :')

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143595
Approved by: https://github.com/zou3519, https://github.com/ydwu4
2024-12-20 01:19:20 +00:00
d547fae5b0 [Codemod][AddExplicitStrictExportArg] caffe2/torch/onnx/_internal/exporter (#143542)
Reviewed By: avikchaudhuri

Differential Revision: D67381244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143542
Approved by: https://github.com/ydwu4, https://github.com/titaiwangms
2024-12-20 00:54:52 +00:00
544de4008e [Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759)
Fix https://github.com/pytorch/pytorch/issues/141671.

Summary:
The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-20 00:35:58 +00:00
8136daff5a Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit 4b82251011f85f9d1395b451d61e976af844d9b1.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))
2024-12-19 23:33:17 +00:00
145fd5bad0 Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847)"
This reverts commit a96387a481633389a6b5a5ac7b8406e9216f320e.

Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/huydhn due to This has been reverted internally D67436053 ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2555942351))
2024-12-19 23:22:44 +00:00
d2b83aa122 add grad_output shape check for fractional_max_pool2d_backward (#141666)
Fix https://github.com/pytorch/pytorch/issues/141102.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141666
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-19 22:47:02 +00:00
2def1f6f74 [caffe2] Move vectorized templates into a separate file for box_cox operator (#143556)
Summary: No functional changes in this diff, the code is moved into a separate file to be reused by avx512 version in the follow up diff.

Test Plan: buck build //caffe2/caffe2/perfkernels:perfkernels

Differential Revision: D67433115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143556
Approved by: https://github.com/hl475
2024-12-19 22:02:23 +00:00
429f4cd140 [AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352)
Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment.

Differential Revision: [D67458526](https://our.internmc.facebook.com/intern/diff/D67458526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143352
Approved by: https://github.com/malfet
2024-12-19 22:01:05 +00:00
e9bd74d763 Revert "[export] don't decompose custom triton op when exporting (#142426)"
This reverts commit 10b9c5944e8d6ff0685e1ef25277a1d3c4c9c5aa.

Reverted https://github.com/pytorch/pytorch/pull/142426 on behalf of https://github.com/huydhn due to This fails one internal MTIA test, checking with the author that we need to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/142426#issuecomment-2555793496))
2024-12-19 21:21:38 +00:00
fc03c62c56 Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (#142062)
Related: #125914 (specifically see [comment](https://github.com/pytorch/pytorch/issues/125914#issuecomment-2513044125))

This PR addresses two broken things involving the usage of unbacked SymInts for calls to `slice()` with data-dependent bounds. These issues are encountered in practice for `narrow()` operating on the batch dim with an NJT input, but apply to other subclasses as well. The test in this PR uses a purpose-built subclass.

There are two different issues here, depending on whether `torch.compile()` is called with `dynamic=True`. In practice, these only occur when the unbacked SymInts are created within the torch_dispatch implementation of a subclass, because the unbacked symbols are considered "freshly created" when the output subclass instance is handled in Dynamo.

**Error 1 (dynamic=False):**
```
LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(-Min(22, Max(0, u0)) + Min(22, Max(u0 + u1, Max(0, u0))), 0) (unhinted: Eq(-Min(s0, Max(0, u0)) + Min(s0, Max(u0 + u1, Max(0, u0))), 0)).  (Size-like symbols: u1, u0)
```

The expression comes from the use of `clamp()` logic for `SliceView` in Inductor:
41e59754b4/torch/_inductor/ir.py (L3014)

If the (start, end) bounds for the `slice()` are statically known to be in range for the given dim (e.g. provided via `torch._check()` calls), we can avoid this `clamp()` logic and the error. This PR implements this fix.

**Error 2 (dynamic=True):**
```
torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {u0} not in returned outputs NestedTensor(size=(2, s16, s1), offsets=FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int64), grad_fn=<NarrowBackwardAutogradNestedTensor0 object at 0x7f1f8603cfd0>, contiguous=True) ((s1*s16, s1, 1), s1*u0)
```

The storage offset of the values component of the returned NJT is `s1*u0` where `s1` is known to be an integer. This PR expands the special logic handling the `constant * u0` case to handle SymInts as well:
314e08eb52/torch/fx/experimental/symbolic_shapes.py (L1013-L1031)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142062
Approved by: https://github.com/ezyang
ghstack dependencies: #143526
2024-12-19 21:08:04 +00:00
0b2c47962c Add support for differentiable LR in SGD + test v2.0 (#143510)
Second PR in a larger project to broader support for differentiable optimizers with @janeyx99 ! The first one had an issue near the end so this is the second PR on that subject. See #143122 for the development up until this point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143510
Approved by: https://github.com/janeyx99
2024-12-19 21:04:44 +00:00
629de4da60 [dynamo] Add a lint rule to restrict what 3P library one can import (#143312)
As title, this patch prevents developers from importing third party
libraries to patch things in Dynamo, unless there's no other easy
workaround (in which case one would add the library to the allowlist in
`import_linter.py`, as instructed by the lint error).

For instance, if we remove `einops` from the allowlist, we'd get this
```verbatim
>>> Lint for torch/_dynamo/decorators.py:

  Error (IMPORT) Disallowed import

    importing from einops is not allowed, if you believe there's a valid
    reason, please add it to import_linter.py

        608  |# Note: this carefully avoids eagerly import einops.
        609  |# TODO: we should delete this whole _allow_in_graph_einops logic by approximately 2024 Q2
        610  |def _allow_in_graph_einops():
    >>> 611  |    import einops
        612  |
        613  |    try:
        614  |        # requires einops > 0.6.1, torch >= 2.0

  Error (IMPORT) Disallowed import

    importing from einops is not allowed, if you believe there's a valid
    reason, please add it to import_linter.py

        612  |
        613  |    try:
        614  |        # requires einops > 0.6.1, torch >= 2.0
    >>> 615  |        from einops._torch_specific import (  # type: ignore[attr-defined]  # noqa: F401
        616  |            _ops_were_registered_in_torchdynamo,
        617  |        )
        618  |
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143312
Approved by: https://github.com/zou3519
2024-12-19 20:59:16 +00:00
8e78345d69 remove allow-untyped-defs from distributed/tensor/experimental/__init__.py (#143583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143583
Approved by: https://github.com/awgu
2024-12-19 20:25:28 +00:00
0a7dba4978 [cond] Change Autograd for cond (#142518)
Instead of returning None for unused variables, a tensor with all-zeros is returned.
Fixes [141301](https://github.com/pytorch/pytorch/issues/141301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142518
Approved by: https://github.com/ydwu4
2024-12-19 20:09:42 +00:00
8850a7b62c add some logging for tensorify (#143391)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143391
Approved by: https://github.com/jamesjwu
2024-12-19 20:06:26 +00:00
25172dc075 remove allow-untyped-defs from torch/ao/quantization/experimental/fake_quantize_function.py (#143582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143582
Approved by: https://github.com/XuehaiPan, https://github.com/laithsakka
2024-12-19 20:06:22 +00:00
2d150ad29f [ROCm] Fix unit test: matmul_offline_mgpu_tunableop (#143507)
Fixes #141652

This PR contains:

- Fix for `matmul_offline_mgpu_tunableop`
- Modifications to _checking_tuning_assertions to enable TunableOp if it is disabled. Also moved it into the concurrent futures initializer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143507
Approved by: https://github.com/jeffdaily
2024-12-19 19:48:20 +00:00
66172578f9 [ROCm] Guard triton backend call around cuda.is_available (#143570)
To resolve: https://github.com/pytorch/test-infra/issues/6082

Calling into Triton's get_backend_options will initialise CUDA and break CPU-only environments that may have hip installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143570
Approved by: https://github.com/atalman, https://github.com/jeffdaily
2024-12-19 19:46:13 +00:00
c46cfc245f [Dynamo] Support dict_keys from nested dict object (#143557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143557
Approved by: https://github.com/williamwen42
ghstack dependencies: #143374, #143547
2024-12-19 19:02:55 +00:00
5fa287aa82 [Dynamo] Rename Dict{View/Keys/Values} to Dict{View/Keys/Values}Variable (#143547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143547
Approved by: https://github.com/williamwen42
ghstack dependencies: #143374
2024-12-19 19:02:55 +00:00
4b82251011 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-19 18:51:26 +00:00
c5ddf5dd90 Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (non-dynamic) (#143526)
Lifted non-controversial (non-dynamic) fixes from #142062. See description there for context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143526
Approved by: https://github.com/ezyang
2024-12-19 18:46:36 +00:00
2a11472f46 update expected results (#143586)
update results based on small regression added by
17b71e5d6a

the max we was 1.25%. for sum_floor_div
<img width="842" alt="Screenshot 2024-12-19 at 9 04 30 AM" src="https://github.com/user-attachments/assets/6ce913cd-110d-4837-af59-08fb6a0dd12d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143586
Approved by: https://github.com/bobrenjc93
2024-12-19 18:43:27 +00:00
e1e83015d2 [dynamo, 3.13t] raise error if torch.compile is attempted in 3.13t (nogil) (#143404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143404
Approved by: https://github.com/colesbury, https://github.com/atalman
2024-12-19 18:10:01 +00:00
33c27be017 Workaround for gather_out in MPS backend (#135543)
Avoids an underlying issue in reshape op in MPS that gets triggered when the input has multiple dimensions but the shape can be squeezed into 1D. The underlying issue is going to get fixed eventually.

Fixes https://github.com/pytorch/pytorch/issues/135240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135543
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-19 18:01:01 +00:00
1433bad0e4 torch export programming model (#143546)
Differential Revision: [D67429743](https://our.internmc.facebook.com/intern/diff/D67429743/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143546
Approved by: https://github.com/ydwu4
2024-12-19 16:56:13 +00:00
61a835ec53 Corrected description of AMSGrad algorithm (#142351)
Fixes #142323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142351
Approved by: https://github.com/janeyx99
2024-12-19 16:24:19 +00:00
171e6a934f Don't 1 specialize if stride is contiguous (#143365)
Fixes: https://github.com/pytorch/pytorch/issues/142024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143365
Approved by: https://github.com/ezyang
2024-12-19 15:22:47 +00:00
465f282a24 [reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)
Reland - https://github.com/pytorch/pytorch/pull/139560

As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
2024-12-19 15:16:10 +00:00
288aa87383 [Inductor][CPU] disable bernoulli_p decomposition (#143460)
Fix https://github.com/pytorch/pytorch/issues/142853
`fallback_random=True` should cause RNG to match between compile/eager (by having compile fall back to eager for RNG ops), but the `bernoulli_p` decompose function is not fully consistent with the eager CPU implementation.
We remove the decomp and keep the version for` fallback_random=False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143460
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-12-19 11:21:35 +00:00
fd8b217fcd Pass allow_rhs_unbacked to the stride test in metadata test too (#143040)
Fixes https://github.com/pytorch/pytorch/issues/142410

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143040
Approved by: https://github.com/bobrenjc93
2024-12-19 09:37:50 +00:00
451c233936 leaking c++ singleton specifically (#143509)
Summary:
fix forward for S477887

leaking c++ singleton specifically

when c++ shutdown, it tries to destruct the singleton and acquire GIL, at this moment python runtime exists already, causing undefined behavior.
Leaking here specifically so that we won't try to destroy singleton at the shutdown phase

Test Plan: n/a

Differential Revision: D67400633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143509
Approved by: https://github.com/c-p-i-o
2024-12-19 09:27:07 +00:00
da06d47bdb dynamo tracing perf: slight improvement on __instancecheck__: 47.77 -> 47.62 (#143064)
See #143056 for overall docs.

This PR: Switch out an `isinstance()` for an `is` in the very hot
`VariableTrackerMeta.__instancecheck__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143064
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-12-19 09:19:35 +00:00
a97c6a78a8 Upgrade submodule ideep for bf16f32 matmul changes (#143508)
This change will enable this PR #140159  to pick proper kernels in bf16 mode for SDPA layer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143508
Approved by: https://github.com/yanbing-j, https://github.com/jgong5
2024-12-19 06:49:16 +00:00
2ffdcab04c [Dynamo] Add DictKeySetVariable to capture dict_keys passed outside of compiled region (#143374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143374
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-12-19 06:39:27 +00:00
fa1a4a91e9 add batch_size check for max_pool2d_backward (#141657)
Fix https://github.com/pytorch/pytorch/issues/140923.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141657
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-19 06:01:41 +00:00
a7ba562ec8 [state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845)
For destributed state dict api [migration](https://github.com/pytorch/torchtune/pull/2138), make the changes here:
1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True
2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid.
3. Some changes to optimize the memory performance:
3.1 use `.detach().clone()` instead of view directly
3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()`
4. add relative unit tests

Memory performance calling from TorchTune with llama2/7B_full:
1. cpu_offload = True
<img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" />

2. cpu_offload = False
<img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142845
Approved by: https://github.com/fegin
2024-12-19 05:06:41 +00:00
e4301aeaa5 [ODML] Make the ML feature provider thread safe (#143418)
Summary:
This PR is generated from a meta internal Diff, aiming to resolve a crash from a race condition on the dictionary.

Test Plan:

Build and run

Print out the count/name/value of the dictionary and see if the values are get/set/removed correctly.

Observe the print statement on app start within IG

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143418
Approved by: https://github.com/shoumikhin
2024-12-19 04:47:56 +00:00
bf44d5bfb5 [Inductor] move custom pre pass (#143458)
Fixes #143363.

Move `joint_custom_pre` pass after `remove_noop_ops`/`constant_folding`, in order to get the same behavior as `pattern_matcher`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143458
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-12-19 04:41:20 +00:00
deb1da15cc [foreach_map] Add foreach_map Adam impl to compiled optimizer tests (#143454)
Adds a foreach_map backed Adam to compiled optimizer tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143454
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-12-19 03:16:47 +00:00
19d8bbafb2 Update release matrix for 2.6 (#143538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143538
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-12-19 02:02:04 +00:00
14fe1f7190 Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit d3ff2d42c28a2c187cbedfd8f60b84a4dfa2d6bf.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))
2024-12-19 01:05:11 +00:00
2c48af568a [CUDA][64-bit indexing] Fix some existing problematic int64_t _ = blockIdx.* * blockDim.* code (#142010)
`grep` didn't surface any `blockIdx.z * blockDim.z` cases
```
git grep -l "int64_t.*=.*blockIdx.x \* blockDim.x.*" | xargs sed -i 's/int64_t \(.*\) = blockIdx.x \* blockDim.x + threadIdx.x;.*/int64_t \1 = ((int64_t) blockIdx.x) * blockDim.x + threadIdx.x;/g'
git grep -l "int64_t.*=.*blockIdx.x \* blockDim.x.*" | xargs sed -i 's/int64_t \(.*\) = threadIdx.x + blockIdx.x \* blockDim.x;.*/int64_t \1 = threadIdx.x + ((int64_t) blockIdx.x) * blockDim.x;/g'
git grep -l "int64_t.*=.*blockIdx.y \* blockDim.y.*" | xargs sed -i 's/int64_t \(.*\) = blockIdx.y \* blockDim.y + threadIdx.y;.*/int64_t \1 = ((int64_t) blockIdx.y) * blockDim.y + threadIdx.y;/g'
git grep -l "int64_t.*=.*blockIdx.y \* blockDim.y.*" | xargs sed -i 's/int64_t \(.*\) = threadIdx.y + blockIdx.y \* blockDim.y;.*/int64_t \1 = threadIdx.y + ((int64_t) blockIdx.y) * blockDim.y;/g'
git grep -l "int64_t.*=.*blockDim.x \* blockIdx.x.*" | xargs sed -i 's/int64_t \(.*\) = blockDim.x \* blockIdx.x + threadIdx.x;.*/int64_t \1 = ((int64_t) blockIdx.x) * blockDim.x + threadIdx.x;/g'
```

See also https://github.com/pytorch/pytorch/pull/141922/files#r1868262823 in #141999 141922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142010
Approved by: https://github.com/ngimel
2024-12-19 00:55:11 +00:00
b4e0e3bfa3 Backout D66648013 (#143433)
Summary:
backing out https://www.internalfb.com/diff/D66648013 (see comments there for justification)

I will reland and disallow the bfloat16 atomics behavior on A100 because it causes a pretty significant performance regression.

Test Plan: This is a revert

Differential Revision: D67357485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143433
Approved by: https://github.com/davidberard98
2024-12-19 00:53:49 +00:00
5c3996cab2 [Dynamo] topologically sort duplicated graph regions (#143523)
Ensure regions are topologically sorted

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143523
Approved by: https://github.com/williamwen42
2024-12-19 00:43:48 +00:00
55092e1ec5 [BE] Delete install sccache step from MacBB (#143512)
To the best of my knowledge, this step never executed and there were no MacOS binary build running on trunk for a while
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143512
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere
ghstack dependencies: #143395, #143511
2024-12-19 00:41:28 +00:00
5e172ea004 [BE] Get rid of malfet/checkout@silent-checkout (#143516)
Instead use `actions/checkout@v4` with `show-progress: false`. It's more verbose than the quiet option, but our logs are long anyway...

Partially addresses https://github.com/pytorch/pytorch/issues/143079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143516
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/huydhn
2024-12-19 00:36:36 +00:00
f9da639950 [codemod] Fix a few unused-variable issues in pytorch (#143517)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143517
Approved by: https://github.com/mhorowitz
2024-12-19 00:18:08 +00:00
b23f11c529 [ONNX] Automatically convert dynamic_axes to dynamic_shapes with torch.export.Dim.AUTO (#143158)
With https://github.com/pytorch/pytorch/pull/133620 introducing Dim.AUTO, we can now automatically convert dynamic_axes to dynamic_shapes without specifying min and max. However, exporting still could be crashed when there are same specs shared between inputs and there is no guarantee that the axes will be dynamic (see PR description).

~~Therefore, a~~ follow-up PR should create a post-processing ONNX side pass to ~~enable the missed dynamic axes~~ rename the dynamic shapes (s0,  s1, ...) to dynamic_axes (user setting names).

This PR does:
(1) Apply torch.export.Dim.AUTO to dynamic_axes when dynamic_shapes is not provided.
(2) Convert args/kwargs to tuple inputs, which follows the generated dynamic_shapes format to avoid errors during torch.export.export.
(3) Avoid KeyError in _rename_dynamic_shapes_with_model_inputs funtion.
(4) Add real world case of a HF model with kv_cache to test on ONNX exporter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143158
Approved by: https://github.com/xadupre, https://github.com/shubhambhokare1
2024-12-18 23:49:01 +00:00
15a7a0c37e Remove deprecated branch after capture_pre_autograd_graph fully migrate to training IR (#143228)
Summary:
as title

#buildall

Test Plan: CI

Differential Revision: D67222286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143228
Approved by: https://github.com/andrewor14
2024-12-18 23:30:45 +00:00
58627fb6bf [BE] Integrate 5 line build script into template (#143511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143511
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere
ghstack dependencies: #143395
2024-12-18 23:27:09 +00:00
4eafbe5288 [Dynamo] Flatten slices during graph deduplication (#143522)
I encountered this issue while debugging torchtune - overall we need to make sure to not miss nodes that are slice arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143522
Approved by: https://github.com/williamwen42
2024-12-18 23:12:34 +00:00
5380407af5 [dynamo] Properly model root frame globals during inlining (#143447)
This patch updates `InliningInstructionTranslator.STORE_GLOBAL` to
properly check whether `self.f_globals` is the same as root frame
`f_globals`. See added comments for why this is important.

Fixes #143425.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143447
Approved by: https://github.com/zou3519
2024-12-18 23:04:02 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
d298bd840f [dynamo] add two-point iter test (#143500)
Implements the last checkbox for https://github.com/pytorch/pytorch/issues/112532.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143500
Approved by: https://github.com/StrongerXi
2024-12-18 22:55:46 +00:00
d3ff2d42c2 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-18 22:30:07 +00:00
4717cd1ce9 Skip test_conv2d_linear_add_broadcast_shapes_cpu on fbcode (#143530)
Summary: The test is added by D67376995 and it is failing on fbcode

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:mkldnn_pattern_matcher_cpu -- --exact 'caffe2/test/inductor:mkldnn_pattern_matcher_cpu - test_conv2d_linear_add_broadcast_shapes_cpu (caffe2.test.inductor.test_mkldnn_pattern_matcher.TestPatternMatcher)'`

Differential Revision: D67413687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143530
Approved by: https://github.com/jansel
2024-12-18 22:08:08 +00:00
d4ed5941db Fix floating point literals in IRPrinter (#142119)
Fixes #114035
This is a recreation of #140002 with approval from its author. Original description:
>when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-18 21:59:48 +00:00
10b9c5944e [export] don't decompose custom triton op when exporting (#142426)
For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable.

#### The alternative:
If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because:
- it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes.
- changes to triton or the serialization logic for triton arguments can be BC breaking
- exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction.

#### Future plans:
After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file **on the same machine that users call export**, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC.

In the long term, we may export multiple cubins for the triton op directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142426
Approved by: https://github.com/zou3519
ghstack dependencies: #142425
2024-12-18 21:36:28 +00:00
1e201422ed [export] add is_exporting flag (#142425)
We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack).

In increasing-scope:
- `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx.
- `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True.
- `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425
Approved by: https://github.com/avikchaudhuri
2024-12-18 21:36:28 +00:00
894d47b91b [ROCm] Fix unit test: matmul_offline_tunableop (#143322)
Fixes #137936

The PR contains:
* Fix for `matmul_offline_tunableop`
* Clean-up try-finally blocks in UTs that don't use environment variables (`test_validator_tunableop_rocm`, `test_minimum_tuning_iteration_tunableop`, `test_disable_tuning_tunableop`)
* Avoid the use of environment variables in `minimum_tuning_iteration_tunableop`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143322
Approved by: https://github.com/jeffdaily
2024-12-18 20:14:44 +00:00
cyy
255a977494 [1/N] Avoid const_cast (#143169)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143169
Approved by: https://github.com/albanD
2024-12-18 19:48:01 +00:00
f129bcb5a5 [BE] Refactor argument parsing into its own function (#143395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143395
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere
2024-12-18 19:42:49 +00:00
8d4926e30a Fix unused variables in test/torch.py (#143399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143399
Approved by: https://github.com/albanD
2024-12-18 17:57:24 +00:00
863e6e4567 Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#141670)
Fix https://github.com/pytorch/pytorch/issues/141447.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141670
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-18 17:46:26 +00:00
b588a78ca3 add grad_output shape check for adaptive_max_pool2d_backward and adaptive_max_pool3d_backward (#141663)
Fix https://github.com/pytorch/pytorch/issues/141099, https://github.com/pytorch/pytorch/issues/141100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141663
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-18 17:44:27 +00:00
93e8e32708 Remove iOS folder (#143398)
This folder is a tutorial that is not packaged in PyTorch that's an example of how to use the now deprecated Lite Interpreter

People should be using Executorch instead and there's already good documentation on it all over our tutorials and main homepage

Testing to see what breaks in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143398
Approved by: https://github.com/albanD
2024-12-18 17:25:52 +00:00
ed9931e6ee Add tests for non divisible inputs for flex decoding (#143214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143214
Approved by: https://github.com/drisspg
2024-12-18 16:32:45 +00:00
0e8013fc1c [AOTI] Fix a typo in cpp_builder.py (#143351)
Summary: passthough -> passthrough

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143351
Approved by: https://github.com/yushangdi, https://github.com/chenyang78
ghstack dependencies: #143350
2024-12-18 16:28:37 +00:00
a2092665a9 [AOTI] Refactor path operations in AotCodeCompiler (#143350)
Summary: Use safer pathlib operation instead of direct string manipulation; Update some path naming to make them more meaningful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143350
Approved by: https://github.com/yushangdi, https://github.com/chenyang78
2024-12-18 16:28:37 +00:00
24a18d76c8 [MPS] Use metal shaders for all view ops (#143375)
Before this PR Metal  shaders were used to scatter/gather 1-5 dimensional tensors.
This PR introduces generalized ones that could be used for any dimensionality and as results  gets rid of 700+ lines complex and untested code that might not even work as expected.
Generalized gather shader looks as follows
```metal
kernel void gather_kernel_n(uint linear_index           [[thread_position_in_grid]],
                            constant void * src_        [[buffer(0)]],
                            device void * dst_          [[buffer(1)]],
                            constant uint32_t * size    [[buffer(2)]],
                            constant uint32_t * stride  [[buffer(3)]],
                            constant uint32_t & numel   [[buffer(4)]],
                            constant int32_t & ndim     [[buffer(5)]]) {{
    if (linear_index >= numel) return;

    constant {0} * src = (constant {0} *)src_;
    device {1} * dst = (device {1} *)dst_;

    uint64_t src_offs = 0;
    auto src_idx = linear_index;
    for(int dim = ndim - 1; dim >= 0; --dim) {{
      src_offs += stride[dim] * (src_idx % size[dim]);
      src_idx /= size[dim];
    }}

    dst[linear_index] = cast<{1}>(src[src_offs]);
}}
```

Which, according to the following benchmark
```python
from timeit import default_timer

import torch
import torch.utils.cpp_extension
from torch.utils.benchmark import Measurement, Timer

t = Timer(
    stmt=f"y.copy_(x);torch.mps.synchronize()",
    setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)",
    language="python", timer=default_timer
)
print(t.blocked_autorange())
```
Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one

On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below
```swift
import Metal
import MetalPerformanceShadersGraph

func gatherComplexMPS(device: MTLDevice,
                inp_buf: MTLBuffer, idx_buf: MTLBuffer,
                out_buf: MTLBuffer,
                inp_elem: Int, upd_elem: Int) {
  let graph = MPSGraph()
  let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil)
  let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil)
  let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32)
  let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer,
                               indicesPlaceholder: mpsIndicesBuffer ],
            targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer])
}

func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer {
  guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let buf_data = buf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<values.count {
    buf_data[i] = values[i]
  }
  return buf
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3])
guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4)
```

Fixes https://github.com/pytorch/pytorch/issues/143140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375
Approved by: https://github.com/albanD
2024-12-18 16:15:46 +00:00
f47aac6bc2 Make Context to be Device-agnostic Step by Step (3/N) (#137578)
Detailed Descriptions:
- Using unified Device-agnostic API to create new generator for accelerator.
- Add deprecated info for GeneratorForPrivateuseone

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137578
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-12-18 15:12:19 +00:00
80a42399bb Various fix for memory leak in test autograd and dataloader (#143323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143323
Approved by: https://github.com/andrewkho, https://github.com/soulitzer
ghstack dependencies: #143225
2024-12-18 13:56:59 +00:00
84b91ce4a1 remove allow-untyped-defs for torch/_inductor/test_operators.py (#143436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143436
Approved by: https://github.com/aorenste
2024-12-18 12:54:25 +00:00
d8ea4ce631 [reland] Kill capture_pre_autograd_graph API (#143426)
Summary:
Delete the following API:

- capture_pre_autograd_graph()
- capture_pre_autograd_graph_using_training_ir()
- gm_using_training_ir()

Update XLA pin to include https://github.com/pytorch/xla/pull/8398

There's no more call sites to `capture_pre_autograd_graph`.

Except
1) two test cases in coreml, guarded by version guard, PR to remove: https://github.com/apple/coremltools/pull/2400
2) a few call sites guarded by version guard (< 2.5.0)

Test Plan: CI

Differential Revision: D67354440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143426
Approved by: https://github.com/gmagogsfm
2024-12-18 12:07:09 +00:00
eb67dd3e2d [3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149)
Design Doc: https://fburl.com/gdoc/47zpuweb
Prototyping:  D66469341

In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot.

In next diff, we will integrate the mtia backend with profiler python api

Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149
Approved by: https://github.com/nautsimon
2024-12-18 11:58:23 +00:00
993b2f0ee0 Fix unused variables in test/test_transformers.py (#143407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143407
Approved by: https://github.com/drisspg
2024-12-18 09:59:24 +00:00
8dd380803c remove allow-untyped-defs for torch/_functorch/batch_norm_replacement.py (#143438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143438
Approved by: https://github.com/oulgen
2024-12-18 09:01:06 +00:00
75fe5a3ef7 remove allow-untyped-defs for torch/fx/experimental/debug.py (#143439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143439
Approved by: https://github.com/oulgen
2024-12-18 08:55:46 +00:00
03991798ca remove allow-untyped-defs for torch/nn/parallel/__init__.py (#143437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143437
Approved by: https://github.com/oulgen
2024-12-18 08:50:37 +00:00
a99536480d [ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955)
Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN.

According to my short test
```Python
import torch
device = "cuda"
dtype = torch.float32

x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype)

for n in range(1024):
    if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]:
        print(f"hermite_polynomial_h: all outputs are nans! n = {n}")
        break

for n in range(1024):
    if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]:
        print(f"hermite_polynomial_he: all outputs are nans! n = {n}")
        break
```

The output values become NaNs at these orders:
```
hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32
hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32
hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64
hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64
```

Surely, it makes sense to increase the limit as a safety margin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955
Approved by: https://github.com/malfet, https://github.com/eqy
2024-12-18 08:30:08 +00:00
2ea4b56ec8 Record min/max of integral tensor in ET (#143088)
Summary:
In et-replay, random data is used to run the operators. However, it does not work well for the op that uses index to access tensor. For example, embedding ops, which use the indices to look up the embedding table. If random data is used for these index ops, et-replay usually runs into invalid memory access issue.

To fix it, ET provides an environment variable "ENABLE_PYTORCH_EXECUTION_TRACE_INTEGRAL_TENSOR_RANGE", if it is set, ET will capture the min/max value of the flattened integral tensor. Then in et_replay, the min/max is used to generate the random tensor within that range. It fixed invalid memory access issue.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_range_cuda

Differential Revision: D66666931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143088
Approved by: https://github.com/sanrise
2024-12-18 08:20:35 +00:00
bceedeec2b fix checking non-trivial input constraints (#143442)
A bunch of auto dynamic shape tests would fail non-strict retraceability because when checking input constraints, we'd compare non-trivial expressions, which would require / affect shape env.
```
... is not tracked with proxy for <torch.fx.experimental.proxy_tensor._ModuleStackTracer object ...
```

I've also observed this bug internally.

This PR does an early check on whether args passed have concrete shapes, and only then proceeds: as before, we
1. try to unify / solve with the arg dim when the corresponding placeholder node dim is symbolic in one symbol
2. check directly if the placeholder node dim is concrete
3. otherwise defer to run time.

Differential Revision: [D67359596](https://our.internmc.facebook.com/intern/diff/D67359596/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143442
Approved by: https://github.com/tugsbayasgalan
2024-12-18 07:29:08 +00:00
90cc43f270 Support garbage collection after pt2 compilation (#143364)
Summary:
Support garbage collection after pt2 compilation.
Add jk to control the global rollout / rollback of this functionality
Add env var to control individual job's rollout

Test Plan:
Test the model training job with / without this changes

Reviewers:
@yuxihu @ezyang , @Yuzhen11 ,

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143364
Approved by: https://github.com/ezyang
2024-12-18 07:25:11 +00:00
9275091d6e [provenance_tracking] Dump inductor_triton_kernel_to_post_grad_nodes.json info in debug_trace (#143055)
Summary:
This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code:
`{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.`

Example paste: P1695235000 verified on the test model.  See "Test Plan":

We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool:
https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab)

https://pxl.cl/66BzK

Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc.

Test Plan:
test_model_coverage.sh:
```
#!/bin/sh
MODEL_ENTITY_ID=644688112
SNAPSHOT_ID=32
MODULE=merge

# buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark

TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 | tee output.txt
```
 {F1973765026}

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)'
```

```
TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor
```

Differential Revision: D66967510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055
Approved by: https://github.com/chenyang78
2024-12-18 06:51:50 +00:00
6829897682 Remove assert from partitioner.py (#143376)
Remove erroneous assert assuming a dependent (user) node to be in the partition. This partially reverts #136616 by removing the assert.

Tested locally with a failing ExecuTorch Arm test using
```
$ python -m examples.arm.aot_arm_compiler --model_name mv2 --target ethos-u55-128 --delegate --quantize
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143376
Approved by: https://github.com/tarun292
2024-12-18 06:08:19 +00:00
6715a8858a Triton bump for 3.2 cherry-picks (device context) (#143409)
Summary:
* https://github.com/triton-lang/triton/pull/3731
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143409
Approved by: https://github.com/atalman
2024-12-18 05:17:29 +00:00
c17a07ade3 Add float8 support in serde schema (#143343)
Summary:
Fix https://github.com/pytorch/pytorch/issues/141316

Bump up schema minor version.

as title, add float8 support in serde schema

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r  test_serialize_float8
```

Differential Revision: D67307670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143343
Approved by: https://github.com/yiming0416
2024-12-18 05:07:21 +00:00
576789197a Add support for CPU scalar in addcmul (#143264)
Step required for performance in #143122

Adds support for CPU scalar for tensor_2 in addcmul. For example:
```
import torch
a = torch.rand(2, 2, device="cuda")
b = torch.tensor(1e-3)

torch.add(a, b)
torch.addcmul(a, a, b)  # used to fail, now works
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143264
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-12-18 04:43:29 +00:00
859be14c4e fix a few int64_t index computations, fix complex128 scan that had to… (#143401)
…o few threads
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143401
Approved by: https://github.com/eqy
2024-12-18 04:27:27 +00:00
c947a7d38e Fix unused Python variables in test/nn (#143396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143396
Approved by: https://github.com/mikaylagawarecki
2024-12-18 03:30:54 +00:00
17a6d4b882 remove allow-untyped-defs for torch/_export/passes/remove_runtime_assertions.py (#143435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143435
Approved by: https://github.com/oulgen
2024-12-18 03:05:20 +00:00
a9de6a68f4 [CD] Test that all PyTorch wheels support OpenMP (#143394)
Together with https://github.com/pytorch/pytorch/pull/143393 fixes https://github.com/pytorch/pytorch/issues/123225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143394
Approved by: https://github.com/atalman
ghstack dependencies: #143393
2024-12-18 02:27:55 +00:00
2400db115c Use Manylinux 2.28 for nightly build and cxx11-abi (#143423)
As per: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581

Linux Builds: CPU, CUDA 11.8, CUDA 12.4 switched to Manylinux 2.28 and D_GLIBCXX_USE_CXX11_ABI=1 on the week of Dec 16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143423
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere
2024-12-18 02:02:58 +00:00
e890d67543 Use process pool for precompilation of triton templates (#142450)
Perf results: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2003%20Dec%202024%2022%3A57%3A51%20GMT&stopTime=Tue%2C%2010%20Dec%202024%2022%3A57%3A51%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/740/head&lCommit=b925256c29ec43e1933e4ede94b16d1f404b595f&rBranch=gh/eellison/740/base&rCommit=a161d6362f7d9db773322d2ce2a3a70aabbecf4b

Training:
<img width="793" alt="image" src="https://github.com/user-attachments/assets/75f5bc0d-8005-4213-ae88-0b94fb187dfc" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142450
Approved by: https://github.com/jansel
2024-12-18 01:48:04 +00:00
c06b5048ba [Inductor] Fix _can_be_inplace function (#143279)
Summary:
Modify _can_be_inplace function: return False if `_other.data` is an instance of `ir.BaseView`.

Fix https://github.com/pytorch/pytorch/issues/143280.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143279
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2024-12-18 00:26:05 +00:00
6cd96f069b Add warning to torch.jit.load (#143403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143403
Approved by: https://github.com/albanD
ghstack dependencies: #143326
2024-12-18 00:17:41 +00:00
ac8342f881 Prevent torch.jit.load path in torch.load when weights_only=True (#143326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143326
Approved by: https://github.com/albanD
2024-12-18 00:17:41 +00:00
13a5c15ef5 Fix sample inputs leaked from subtest (#143415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143415
Approved by: https://github.com/jbschlosser
ghstack dependencies: #143333
2024-12-18 00:15:18 +00:00
3f99682fbd NJT linear_backward should not return inner tensor as-is (#143333)
Fixes debug=1 use-count checks https://github.com/pytorch/pytorch/actions/runs/12187808902/job/34002323481#step:22:2521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143333
Approved by: https://github.com/jbschlosser
2024-12-18 00:15:18 +00:00
feb4818bc9 [SJD] adding kill logic for current process when killing a worker (#141060)
Summary:
we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid

something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out

Test Plan: experiment in next diff shows this works

Differential Revision: D65837085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060
Approved by: https://github.com/gag1jain
2024-12-18 00:13:02 +00:00
efe21ee59d [MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347)
Summary: This diff implements the "max_memory_allocated" PyTorch API for MTIA devices, which returns the peak device DRAM usage

Test Plan:
Passed the local unit test
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated
```

https://www.internalfb.com/intern/testinfra/testrun/8444249544807192

Reviewed By: yuhc, egienvalue

Differential Revision: D67118173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143347
Approved by: https://github.com/nautsimon
2024-12-17 23:37:03 +00:00
a040006da7 Force symlink creation when building python on s390x (#143195)
Sometimes it exists already when building on s390x

This change should fix docker image build on s390x.
Example of error can be found here:
https://github.com/pytorch/pytorch/actions/runs/12282230596/job/34365267303
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143195
Approved by: https://github.com/ezyang
2024-12-17 23:01:47 +00:00
2642bbc6dc [CD] Run smoke tests on MacOS wheel (#143393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143393
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-12-17 22:47:07 +00:00
b247f87845 tools: Add a tool to build wheels for multiple python versions (#143361)
Adds a tool to build bdist_wheels sequentially for multiple different
python versions (if specified).

The goal of this tool is to eventually be able to utilize this in our
binary build runs to significantly reduce the amount of time we take to
build packages by utilizing a local ccache from the first build.

Tested locally using the following:
```
$ ccache -C # clear cache
# -p could actually reference any python interpreter
$ python tools/packaging/build_wheel.py \
	-p /home/eliuriegas/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/bin/python3.12 \
	-p /home/eliuriegas/.local/share/uv/python/cpython-3.13.0-linux-x86_64-gnu/bin/python3.13 \
	-d dist-multi/
...
2024-12-17 10:48:11,365 - INFO - Build time (3.12.7): 571.440689s
2024-12-17 10:48:11,365 - INFO - Build time (3.13.0): 191.147503s
```

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143361
Approved by: https://github.com/malfet, https://github.com/atalman
2024-12-17 21:56:06 +00:00
1e058a8f38 FileTimerClient: add retry logic on connect (#143318)
Fixes #143188

The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side.

Test plan:

```
pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318
Approved by: https://github.com/fegin
2024-12-17 21:48:30 +00:00
aabe285aaf Add 2 more APIs to the exposed public torch python APIs (#143380)
These two APIs are being used internally for some projects and need to be exposed as the build for this is done using OSS toolchain.

af8789c056 - this change hid most apis in torch python barring the ones explicitly specified breaking the build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143380
Approved by: https://github.com/suo
2024-12-17 21:16:51 +00:00
0bdc173ab6 [fr] recognize all_reduce_barrier as a valid op (#143354)
Summary:
D67068632 introduced a better profiling name for barrier operations to be able to distinguish various ops.

Unfortunately, this broke Flight Recorder Analysis with the following error as reported by dmwu
```
fr_trace -m torchx-param_bench_16g_mi300x-all_to_all -a 0 --mast_job_version 98 -w 16
Traceback (most recent call last):
  File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code
```

Test Plan: Test manually.

Differential Revision: D67305997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143354
Approved by: https://github.com/wconstab
2024-12-17 21:09:18 +00:00
a96387a481 [Dynamo] only import einops if version is lower than 0.7.0 (#142847)
Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847
Approved by: https://github.com/zou3519
2024-12-17 20:50:25 +00:00
9283c40ba8 [codemod] Decorate unused variables with [[maybe_unused]] (#143381)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143381
Approved by: https://github.com/malfet
2024-12-17 20:36:03 +00:00
7c25a55c65 clean up type nits on torch/jit/_ir_utils.py (#143371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143371
Approved by: https://github.com/laithsakka
2024-12-17 20:28:07 +00:00
de4a555c82 Run inductor-rocm workflow on ciflow/inductor (#143205)
The paths are almost the same as ciflow/inductor.  The only differences I could spot where that ciflow/inductor also has `test/dynamo/**` and `torch/csrc/dynamo/**`

This is to prevent failures like https://github.com/pytorch/pytorch/actions/runs/12304985383/job/34345585535 which fails due to running on a fork, which cannot set the id token.

The other option to prevent this is to stop the job from running when on a fork.

If someone adds both labels, one will be cancelled because they have the same concurrency group

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143205
Approved by: https://github.com/huydhn
2024-12-17 20:09:48 +00:00
b16f020edd Add flex attention kernel parameter tuning options (#139639)
1. Add `num_warps` and `num_stages` to kernel parameters of `flex_attention`. This allows performance tuning when the default parameters of `flex_attention` is suboptimal, for example for `document_masks`.
2. Update how flex decoding splits are assigned to threadblocks. The first split of full blocks are assigned to the first threadblock, and the first split of partial blocks are assigned to the last threadblock.
3. Update `get_split_k` to assign 2 splits per SM before we have runtime workload balancing based on BlockMask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139639
Approved by: https://github.com/drisspg
2024-12-17 19:31:40 +00:00
e3c53fb1bc Increase sharding for debug build (#143327)
It started timing out consistently and takes 3+ hours per shard

I assume its just that we slowly increase tests over time since I cannot find a dramatic jump recently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143327
Approved by: https://github.com/wdvr, https://github.com/huydhn
2024-12-17 19:27:51 +00:00
5b5d7016c8 Remove stable_partition for ARM AOTI Runtimes (#142394)
Summary: This function call will cause OOM issues on ARM machines with multi-threaded predictors (reason behind this is still being investigated), we replace it with the standard partition instead.

Differential Revision: D66904296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142394
Approved by: https://github.com/frank-wei
2024-12-17 19:19:04 +00:00
e7704f41ca Simplify _compute_symbolic_stride() (#138844)
Rewrite _compute_symbolic_stride() to make it simpler and faster.

The existing code involves several inner loops in an attempt to process the common case faster - but in reality this effort is actually slower than the simpler code.

Testing:
The initial version of this PR (which passed all tests) ran both the old algorithm and new algorithm and compared the results to make sure that results were substantially the same (they weren't the same simply because the algorithm allocates new dynamic symbols as part of it).

I also measured the timing of both methods and from the cases I checked the simpler algorithm was generally about 30% faster (which was usually the "fast path" of the old algorithm).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138844
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #138843
2024-12-17 19:16:53 +00:00
63cb5e4ade Move inner loop of _create_symbolic_sizes_strides_storage_offset into its own method (#138843)
Making the next PR easier to review:
- move the inner loop of  _create_symbolic_sizes_strides_storage_offset() into a separate function
- fix lintrunner lints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138843
Approved by: https://github.com/ezyang
2024-12-17 19:16:53 +00:00
f3ec59d44c Fix non-dense inductor effn attn bias (#141905)
Didn't have any luck making local repro, partially because https://github.com/pytorch/pytorch/issues/141888 which will be fixed when we update to triton 3.2. but verified locally it fixes https://github.com/pytorch/pytorch/issues/139424 with the triton pin update that is landing soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141905
Approved by: https://github.com/drisspg
ghstack dependencies: #143315
2024-12-17 18:55:50 +00:00
1e9ec51431 Fix unused variables in test_serialize_sym_float (#143389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143389
Approved by: https://github.com/Skylion007
2024-12-17 18:55:14 +00:00
18261e9f39 [dynamo] implement framelocals mapping as c++ object (#140063)
Implements https://github.com/pytorch/pytorch/issues/93753 - move frame local guard accessors to C++.

Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible.

Some implementation notes:
- `FrameLocalsMapping` is now initialized as a C++ vector of `PyObject`s. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells.
- `FrameLocalsMapping` can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily.
- We update `LeafGuard`, `GuardAccessor`, and `GuardManager`'s `check_nopybind` methods to accept `FrameLocalsMapping`. By default, we convert the `FrameLocalsMapping` to a Python dict and run the original `check_nopybind` on it, but in some cases, conversion is not needed.
- We add a new guard accessor `FrameLocalsGuardAccessor`, which is similar to `DictGetItemGuardAccessor` but has special handling for `FrameLocalsMapping`. We create a separate class to emphasize different use cases, but we could probably combine these two (can do in a follow up)

dynamo_guard_eval.py microbenchmark update:
- 713.2us -> 630.0us (3.10)
- 598.8us -> 530.7us (3.12)

Other followups:
- Add `FrameLocalsMapping` version for `check_verbose_nopybind` in order to match behavior between `check_nopybind` and `check_verbose_nopybind`. This can prevent difficult debugging situations where guards fail (`check_nopybind` returns false) but no guard error message is generated (`check_verbose_nopybind` succeeds).
- Rewrite the `SHAPE_ENV` guard into C++ - it is a fairly common guard that results in `FrameLocalsMapping` needing to convert to a dict

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140063
Approved by: https://github.com/jansel
ghstack dependencies: #142117, #142430
2024-12-17 18:54:27 +00:00
c04f0bb7b9 [dynamo] add benchmark for guard eval (#142430)
Benchmarks:
- 713.2us (3.10)
- 598.8us (3.12)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142430
Approved by: https://github.com/jansel
ghstack dependencies: #142117
2024-12-17 18:54:27 +00:00
97ca09f692 [dynamo] format eval_frame.c (#142117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142117
Approved by: https://github.com/jansel
2024-12-17 18:54:27 +00:00
53e4d7b6a2 remove allow-untyped-defs for torch/_lazy/device_context.py (#143367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143367
Approved by: https://github.com/aorenste
ghstack dependencies: #143366
2024-12-17 18:54:03 +00:00
bcc93a1e8e remove nonowninglayout special case in require strides (#143315)
NonOwningLayout is always constructed to a FixedLayout. We should handle it the same way as FixedLayout. Note - this case is very rare, I added an assertion here and no test/model failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143315
Approved by: https://github.com/zou3519
2024-12-17 18:47:38 +00:00
a3688ead4b [AOTI][doc] Update tutorial (#143390)
Summary: Update the cpp inference part to call AOTIModelPackageLoader.run directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143390
Approved by: https://github.com/yushangdi
2024-12-17 18:35:40 +00:00
fa4db62968 [CI] Unify the XPU Windows CICD installtion scripts (#143185)
Follow https://github.com/pytorch/pytorch/pull/142156
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143185
Approved by: https://github.com/atalman
2024-12-17 18:26:19 +00:00
74e66a21b4 remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143369
Approved by: https://github.com/aorenste
2024-12-17 18:09:28 +00:00
37a1b9efcc [export] Serialize all dataclass fields (#142286)
Reverts a change in #121337. All dataclass members must be serialized, even default-valued members, because downstream code often implicitly assumes their presence.

This PR fixes a segfault when running `test_custom_op_all_inputs` from `test/inductor/test_aot_inductor_custom_ops.py`. This segfault was caused by querying for an "index" field for the `Device` type (see `torch/csrc/inductor/aoti_torch/oss_proxy_executor.cpp:136`), which was previously skipped when serializing if the device index was unspecified. A number of other structs which are deserialized in this file also contain optional fields, and presumably could experience the same bug.

Fixes #138955

Fixes #134793
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142286
Approved by: https://github.com/zhxchen17
ghstack dependencies: #142175
2024-12-17 17:21:27 +00:00
bb06fc79fb cpp_builder: handle CUDA lib paths involving "stubs" in more circumstances (#142175)
conda packages for `cuda-driver-dev=12.4.127` use a "stubs" subdirectory to contain `libcuda.so`.  This was previously only handled by cpp_builder in some cases, but now needs to be potentially handled more generally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142175
Approved by: https://github.com/desertfire
2024-12-17 17:21:27 +00:00
e3d754419f Revert "[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)"
This reverts commit 1bf983077f9f9c19e20dac178aa764b4620d78e7.

Reverted https://github.com/pytorch/pytorch/pull/141085 on behalf of https://github.com/huydhn due to The diff D66211131 has been commandeered internally and is it not part of the train anymore.  If codev is needed, pls reland this accordingly ([comment](https://github.com/pytorch/pytorch/pull/141085#issuecomment-2549092225))
2024-12-17 17:21:14 +00:00
ec02ae4345 remove allow-untyped-defs for torch/utils/benchmark/examples/simple_timeit.py (#143368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143368
Approved by: https://github.com/aorenste
2024-12-17 17:19:11 +00:00
313b9964ae remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143370
Approved by: https://github.com/aorenste, https://github.com/desertfire
ghstack dependencies: #143366
2024-12-17 17:18:10 +00:00
487343346e Prevent users from seeing hardcoded print stmt when hypothesis is not installed (#142398)
Fixes: #142357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142398
Approved by: https://github.com/zou3519
2024-12-17 16:59:05 +00:00
969b07b96f Revert "[ROCm] CK Flash Attention Backend (#138947)"
This reverts commit 500d02921bcf1619e268196866ddf099a4b94080.

Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))
2024-12-17 16:46:57 +00:00
cd7de1f4fa remove allow-untyped-defs for torch/masked/maskedtensor/creation.py (#143321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143321
Approved by: https://github.com/laithsakka
2024-12-17 16:44:50 +00:00
4d90c487d8 [AOTI] Add is_big_gpu checking to test_conv3d (#143339)
Summary: test_conv3d tests max-autotune, which is only supported for big_gpu.

Differential Revision: D67306331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143339
Approved by: https://github.com/BoyuanFeng
2024-12-17 16:18:45 +00:00
792f1c47e9 No actual change, just remove variable contain Tensors from global scope (#143225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143225
Approved by: https://github.com/ezyang
2024-12-17 16:14:25 +00:00
afa313e669 Extend bmm tiling to work up to 2^32 elem in any single output dim (#143095)
The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250).

Fixes #141909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095
Approved by: https://github.com/kulinseth
2024-12-17 16:03:46 +00:00
340f02c49b make it clearer (in docs) one can double decorate with torch.library.impl_* APIs (#137608)
Fixes #120503. Fix originally attempt by @soxand16 with PR: https://github.com/pytorch/pytorch/pull/121469. PR was almost ready to merge, but then went stale (over 6 months old). This PR implements original fix with refactoring for clarity.

CC: @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137608
Approved by: https://github.com/zou3519
2024-12-17 15:13:58 +00:00
6bbbb08458 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [10/N] (#142451)
> This is the last one

related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688
- #140922
- #140924
- #140933
- #142451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142451
Approved by: https://github.com/bdhirsh
2024-12-17 12:18:29 +00:00
34a0d8b62e [inductor] invalidate pointwise dep cache for LOAF (#141160)
Fixes https://github.com/pytorch/pytorch/issues/141134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141160
Approved by: https://github.com/vkuzo
2024-12-17 09:51:29 +00:00
5160a725c8 [FlexAttention] Fix broken eager tracing (#143344)
Fixes #143331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143344
Approved by: https://github.com/Chillee
ghstack dependencies: #143299
2024-12-17 09:42:36 +00:00
cf46eb3bf5 [inductor] Include types and size hints in MultiKernel cache key (#142349)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142349
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-12-17 09:26:38 +00:00
e2d47a133b Disable c10::optional macros (#138912)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138912
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-17 09:22:47 +00:00
c3f3a6e4d2 Back out "Fix undesired specialization on slice after split. (#142372)" (#143356)
Summary:
Original commit changeset: e54ffcc9fd48

Original Phabricator Diff: D67113058

Reviewed By: ezyang

Differential Revision: D67311579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143356
Approved by: https://github.com/oulgen
2024-12-17 09:17:18 +00:00
2531543c5f [user triton cache] Dedup user-defined Triton kernels by config in codecache (#143353)
Previously, the same kernel source with different autotuning configs would generate the same cache key which can lead to wrong cache it and silent incorrectness. Here we add the configs to the cache key in `FxGraphHashDetails`.

Test Plan:

```
python3 test/inductor/test_codecache.py -k test_triton_higher_order_op_different_configs
...
----------------------------------------------------------------------
Ran 2 tests in 3.590s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143353
Approved by: https://github.com/oulgen
2024-12-17 08:41:22 +00:00
6056efc5ff non strict sequential slicing (#143298)
Differential Revision: [D67284841](https://our.internmc.facebook.com/intern/diff/D67284841/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143298
Approved by: https://github.com/zhxchen17
2024-12-17 08:35:20 +00:00
297ce77636 [Inductor] inplace padding (#140249)
https://github.com/pytorch/pytorch/issues/139865

This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine.

Perf for `test_linear_and_cel`:
```
# TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=False ms=83.311

# TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=True ms=79.827
```

The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c.
- Without the feature: 182.151ms per batch, 180.9K tokens/s
- With the feature:  178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase.

Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) .

UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249
Approved by: https://github.com/jansel
2024-12-17 06:15:48 +00:00
a42ca5a45b remove allow-untyped-defs for _inductor/codegen/rocm/rocm_template_buffer.py (#143272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143272
Approved by: https://github.com/aorenste
2024-12-17 05:34:22 +00:00
d2ec7f0756 [FlexAttention] Allow num_warps 8 since when block size >=128 (#143299)
# Summary
Fixes #143290

We already strip bad configs here: e0e763e331/torch/_inductor/kernel/flex_attention.py (L2299)
So this shouldn't be needed. Confirming that the 64 x 128 case is valid otherwise we can just change the default config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143299
Approved by: https://github.com/yanboliang
2024-12-17 05:32:41 +00:00
e7ec92331e remove allow-untyped-defs for torch/jit/_ir_utils.py (#143366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143366
Approved by: https://github.com/aorenste
2024-12-17 05:15:15 +00:00
bcd3692132 [Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142474)
Summary:
**Re-land the pr**. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`.

```
_____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________
Unexpected success
```
------
(Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.)

-----
Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference.

The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0

-------

The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`.

Before the change:
`shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold`
After the change:
`shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused`

----
It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again.

Test Plan:
```
buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering
```
-----
Ran a float8 dynamic scaling training script to verify it e2e

Differential Revision: D67012816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142474
Approved by: https://github.com/eellison, https://github.com/sijiac, https://github.com/shunting314
2024-12-17 04:14:28 +00:00
500d02921b [ROCm] CK Flash Attention Backend (#138947)
Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947
Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian

Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
2024-12-17 02:18:07 +00:00
c15638d803 Enable swap on all Linux jobs (#143316)
A swapfile on Linux runner has been prepared by https://github.com/pytorch/test-infra/pull/6058.  So this PR does 2 things:

* Start using the swapfile on all Linux build and test jobs
* Testing the rollout https://github.com/pytorch-labs/pytorch-gha-infra/pull/582

### Testing

Run `swapon` inside the container and the swapfile shows up correctly:

```
jenkins@259dfb0a314c:~/workspace$ swapon
NAME      TYPE SIZE USED PRIO
/swapfile file   3G 256K   -2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143316
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2024-12-17 02:12:24 +00:00
cb4c614ed6 [foreach-map] Add tests for backward (#143282)
Adds tests for unary and binary foreach_map w/ backwards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143282
Approved by: https://github.com/eellison
2024-12-17 02:08:12 +00:00
533d63f83b Revert "FileTimerClient: add retry logic on connect (#143318)"
This reverts commit b3fb8f8a3a2fe07ca61852b09271382c988629fc.

Reverted https://github.com/pytorch/pytorch/pull/143318 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143318#issuecomment-2547342910))
2024-12-17 02:06:52 +00:00
cyy
201cb8834f Enable more C++ warnings (#143099)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143099
Approved by: https://github.com/albanD
2024-12-17 02:03:39 +00:00
af190479c8 [fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160)
## Benchmark
M=2048, N=3584, K=8192

baseline (nccl + cublas): 301us
decomp-based async-tp: 354us
comm-aware async-tp: 295us
**multimem_all_gather matmul: 277us**

As M further decreases, the multimem_all_gather approach consistently outperforms the baseline and other approaches (omitted other approaches in the chart as they start to be slower than the baseline):
![image](https://github.com/user-attachments/assets/5811455a-68c9-43fe-9d82-ca488dd77bc1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143160
Approved by: https://github.com/weifengpy
ghstack dependencies: #142283, #142810, #143159
2024-12-17 01:07:27 +00:00
286921b39e [fused_all_gather_matmul] introduce an argument to specify whether the all-gather result needs to be returned (#143159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143159
Approved by: https://github.com/weifengpy
ghstack dependencies: #142283, #142810
2024-12-17 01:07:27 +00:00
6fae60a34a [SymmetricMemory] introduce multimem_all_gather (#142810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142810
Approved by: https://github.com/weifengpy
ghstack dependencies: #142283
2024-12-17 01:07:27 +00:00
519d858c31 Revert "Kill capture_pre_autograd_graph API (#143224)"
This reverts commit 4c62275325afe21052f3fd49ed4135e3db3c47eb.

Reverted https://github.com/pytorch/pytorch/pull/143224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure is legit ([comment](https://github.com/pytorch/pytorch/pull/143224#issuecomment-2547264675))
2024-12-17 00:47:24 +00:00
9d57a39541 [C10D] Update docs for wait() (#143305)
Clarify that currently active stream, not default stream, is the one
that will be blocked by a call to wait(), and also point out that the
CPU is not blocked by the call for CUDA/nccl collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143305
Approved by: https://github.com/LucasLLC, https://github.com/ngimel
2024-12-17 00:41:11 +00:00
b3fb8f8a3a FileTimerClient: add retry logic on connect (#143318)
Fixes #143188

The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side.

Test plan:

```
pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318
Approved by: https://github.com/fegin
2024-12-17 00:36:10 +00:00
90fb7c36ab [FSDP2] Clamp reduce_dtype in lazy init (#143297)
fixes https://github.com/pytorch/pytorch/issues/143277 by moving the clamp of `reduce_dtype` to `None` to lazy init (same place as where `param_dtype` can be clamped to `None`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143297
Approved by: https://github.com/weifengpy
2024-12-17 00:25:08 +00:00
dd2cd4279e Create build_directory if it does not exist when generating ninja build file (#143328)
Fixes: https://github.com/pytorch/vision/issues/8816
I am observing this failure on Windows, Python 3.13 vision builds:
```
Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja...
error: [Errno 2] No such file or directory: 'C:\\actions-runner\\_work\\vision\\vision\\pytorch\\vision\\build\\temp.win-amd64-cpython-313\\Release\\build.ninja'
ERROR conda.cli.main_run:execute(49): `conda run packaging/windows/internal/vc_env_helper.bat python setup.py bdist_wheel` failed. (See above for error)
```

Adding the code above fixes it, confirmed by running `` python setup.py bdist_wheel`` :
```
building 'torchvision._C' extension
Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja...
Creating build directory C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/26] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc -Dtorchvision_EXPORTS -IC:\actions-runner\_work\vision\vision\pytorch\vision\torchvision\csrc -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\TH -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\THC -IC:\actions-runner\_work\_temp\conda_environment_12361066769\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Include "-IC:\Pr
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143328
Approved by: https://github.com/kit1980, https://github.com/albanD
2024-12-17 00:20:43 +00:00
467970d683 [AOTI] Relax input alignment assertion (#143236)
Summary: https://github.com/pytorch/pytorch/pull/142136 added a runtime alignment assertion. But the assumption is probably too strict for more flexible use cases of AOTI, e.g. python deployment, see a recent error torchchat ran into for more details, https://github.com/pytorch/torchchat/actions/runs/12322072267/job/34394851280 . This PR relaxes the runtime check and implements copy_misaligned_inputs in cpp instead.

Differential Revision: [D67287922](https://our.internmc.facebook.com/intern/diff/D67287922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143236
Approved by: https://github.com/malfet, https://github.com/chenyang78
2024-12-17 00:17:39 +00:00
c4ab3e6ceb remove allow-untyped-defs for torch/__config__.py (#143320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143320
Approved by: https://github.com/aorenste
ghstack dependencies: #143319
2024-12-17 00:16:09 +00:00
0178e43949 remove allow-untyped-defs for torch/utils/_stats.py (#143319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143319
Approved by: https://github.com/aorenste
2024-12-17 00:16:09 +00:00
ff373171d0 [Profiler] Add Optional Flag to turn off external correlations v2 (#143314)
Summary: The original diff got reverted because its base commit was on a broken version of pytorch that was failing rocm tests. There is no indication that this diff had any effect on rocm. Had trouble rebasing the GH pr after revert and accidentally closed the PR so submitting again .

Test Plan: See original PR with same name

Differential Revision: D67293040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143314
Approved by: https://github.com/leitian, https://github.com/aaronenyeshi
2024-12-16 23:49:13 +00:00
10df370a77 Add missing IValue overloads for SymInt lists (#143167)
We should be able to convert Int lists into SymInt lists.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143167
Approved by: https://github.com/ezyang
ghstack dependencies: #143166
2024-12-16 23:18:55 +00:00
557da8014d [gen_autograd_functions] rename some variables (#143166)
This is a follow-up from https://github.com/pytorch/pytorch/pull/141278.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143166
Approved by: https://github.com/soulitzer
2024-12-16 23:18:55 +00:00
4c62275325 Kill capture_pre_autograd_graph API (#143224)
Summary:
Delete the following API:

- capture_pre_autograd_graph()
- capture_pre_autograd_graph_using_training_ir()
- gm_using_training_ir()

There's no more call sites to `capture_pre_autograd_graph`.

Except
1) two test cases in coreml, PR to remove: https://github.com/apple/coremltools/pull/2400
2) XLA: one test case in pytorch/xla, PR to remove: https://github.com/pytorch/xla/pull/8398
3) a few call sites guarded by version guard (< 2.5.0)

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D64056353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143224
Approved by: https://github.com/tugsbayasgalan
2024-12-16 23:06:22 +00:00
6356690b3d Revert "[BE] Revert "Add conda to Manylinux Docker images (#139903)" (#143300)"
This reverts commit c86383f956ee86f34d0ffb94bc229c51c6f11dd9.

Reverted https://github.com/pytorch/pytorch/pull/143300 on behalf of https://github.com/atalman due to failing nova workflows with conda: command not found ([comment](https://github.com/pytorch/pytorch/pull/143300#issuecomment-2547030664))
2024-12-16 22:50:08 +00:00
135a2d4483 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2024-12-16 21:46:08 +00:00
15aee8e090 update aten bmm CK heuristic (#143294)
Summary: updates heuristic to use new instances based on ck profiling of LLM shapes

Differential Revision: D67280269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143294
Approved by: https://github.com/mxz297, https://github.com/xw285cornell
2024-12-16 21:44:59 +00:00
c86383f956 [BE] Revert "Add conda to Manylinux Docker images (#139903)" (#143300)
This reverts commit 56a40d4ebb0bcf733f1ea5f6efde805326a7a565.

Having conda in manylinux builder images is not required. This was added to have manylinux-builder images as the only images for CD builds after conda-builder is deprecated. However we decided to start using ``almalinux-builder``.

We are using almalinux-builder for linux_job_v2 which contains conda: https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job_v2.yml#L114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143300
Approved by: https://github.com/seemethere
2024-12-16 21:40:08 +00:00
4e594f4d12 Triton bump for 3.2 cherry-picks (mmav3 segfault fix, gfx950 support) (#143302)
* https://github.com/triton-lang/triton/pull/5277
* https://github.com/triton-lang/triton/pull/5084
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143302
Approved by: https://github.com/atalman, https://github.com/pruthvistony
2024-12-16 21:22:29 +00:00
401b1498d2 [BE] typing for decorators - distributed/_tensor/ops/utils (#142139)
Test Plan: unit tests

Differential Revision: D62302679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142139
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-12-16 21:19:33 +00:00
159b7ad8aa Improve async workers to handle forking for async compile (#142072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142072
Approved by: https://github.com/masnesral
2024-12-16 21:16:42 +00:00
678f74988d Fix a misspelling [ONNX] (#143301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143301
Approved by: https://github.com/titaiwangms
2024-12-16 20:19:41 +00:00
8ad842cda4 remove allow-untyped-defs for utils/data/datapipes/dataframe/structures.py (#143273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143273
Approved by: https://github.com/aorenste
ghstack dependencies: #143271
2024-12-16 20:07:36 +00:00
54ed13cdce Revert "Update low prec codegen for div/mod (#142350)"
This reverts commit ca973069ed9a08782695d9407605e219008821e2.

Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it. breaks an internal test ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2546615951))
2024-12-16 20:05:14 +00:00
e885225eda Add persistent+TMA version of Triton mm and addmm (#142101)
This PR adds persistent+TMA versions (Triton template + the corresponding infra) for the `tuned_mm` and `tuned_addmm` lowerings. The persistent+TMA choices are added to the GEMM autotuning if (checked by the `use_triton_tma_template` helper):

1. The min. hardware and Triton version requirements are met for the TMA support.

2. The GEMM inputs are compatible with the Triton TMA API (i.e., 16-byte aligned and contiguous).

3. The `config.triton.enable_persistent_tma_matmul` is set to `True`.

Additional notes:

1. As added in this PR, the TMA uses are not compatible with prolog / epilogue fusion. To this end, in the new Triton template we currently support: TMA-based loads of A/B, but no prologue fusion; epilogue fusion, but no TMA-based stores of C. TMA + fusion compatibility can be added as a follow-up.

2. The current Triton TMA API (`experimental_device_tensormap_create2d`) does not support strides. Due to this, we limit the applicability of the new Triton template to the cases where the inputs are contiguous.

3. The transposed layouts of A and / or B are supported by passing the constexpr flags to the kernel and adjusting the ordering of the block sizes accordingly in the kernel code (this should have no effect on the kernel perf, as decided at the Triton compilation time).

4. After the next Triton pin update, we can switch to the tensor descriptor API (landed recently in https://github.com/triton-lang/triton/pull/5290) in the new Triton template, which should allow lifting 2 and 3 above.

5. The configs for the new Triton template in `persistent_mm_kernel_configs` are preliminary. We should do more perf exploration and possibly augment the config in a follow-up.

6. This PR is rebased onto and unifies with two related PRs landed previously: https://github.com/pytorch/pytorch/pull/142045 (some infra unification with the persistent+TMA template for _scaled_mm) and https://github.com/pytorch/pytorch/pull/134532 (add possibility to disable prolog fusion for selected choices).

7. The current Triton TMA API only supports 1D and 2D descriptors (even after https://github.com/triton-lang/triton/pull/5290, see [here](9829ce87cc/python/triton/language/core.py (L1957))). For now, this blocks adding persistent+TMA template for `torch.bmm`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142101
Approved by: https://github.com/drisspg, https://github.com/eellison
2024-12-16 19:12:12 +00:00
17b71e5d6a Add config alias (#142088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142088
Approved by: https://github.com/c00w
2024-12-16 18:51:17 +00:00
1b6b86fad7 [dynamo] disable eval frame callback around most of _TorchDynamoContext wrapper function (#143211)
Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1559636954674510/

If the `_fn` returned by `_TorchDynamoContext.__call__` makes an external function call, dynamo is recursively invoked. This can cause issues if there are added calls that are not skipped by Dynamo. So we should disable the eval frame callback as much as possible.

Differential Revision: [D67211749](https://our.internmc.facebook.com/intern/diff/D67211749)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143211
Approved by: https://github.com/jansel
2024-12-16 18:38:58 +00:00
1bf983077f [reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)
Reland - https://github.com/pytorch/pytorch/pull/139560

As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
2024-12-16 18:38:32 +00:00
338835d0d2 Add support for other backends in get_preferred_device (#132118)
Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118
Approved by: https://github.com/kwen2501
2024-12-16 18:30:41 +00:00
ccf35af142 [Inductor] Fix the Index Put lowering with same input of self and values (#139366)
**Summary**
Fix the issue: https://github.com/pytorch/pytorch/issues/138908, the root-cause is in https://github.com/pytorch/pytorch/issues/138908#issuecomment-2449192447

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_put
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_add
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139366
Approved by: https://github.com/jgong5, https://github.com/eellison
2024-12-16 17:07:14 +00:00
7ab3177776 Revert "[AMD] Turn on TF32 for aten::mm (#139869)"
This reverts commit e0bdae7884aed09d9e3f1a3f7a53c095e74a9aff.

Reverted https://github.com/pytorch/pytorch/pull/139869 on behalf of https://github.com/jeffdaily due to causing ROCm CI failures, need to investigate, revert for now ([comment](https://github.com/pytorch/pytorch/pull/139869#issuecomment-2546127069))
2024-12-16 16:46:48 +00:00
a8cc19bb51 [CD] Fix XPU linux CD whl test failure (#143268)
Follow https://github.com/pytorch/pytorch/pull/142482, refer the original fix PR https://github.com/pytorch/pytorch/pull/130742 and new issue in https://github.com/pytorch/pytorch/actions/runs/12323126436/job/34403681230
Works for https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143268
Approved by: https://github.com/atalman
2024-12-16 15:00:03 +00:00
e4d2e81086 Update slow tests (#143278)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143278
Approved by: https://github.com/pytorchbot
2024-12-16 12:40:40 +00:00
d745b2b516 remove allow-untyped-defs for distributed/rpc/_testing/__init__.py (#143271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143271
Approved by: https://github.com/aorenste
2024-12-16 02:35:37 +00:00
9706ada369 [RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677)
# Motivation
This PR intends to add C++ accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is f84e533a2c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #143171, #133572
2024-12-16 02:18:41 +00:00
45ac4ebf15 [RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)
# Motivation
This PR intends to add UTs for accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is 952514f0c8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #143171
2024-12-16 02:18:41 +00:00
c1d4d9d3cf [MPS] Support torch.accelerator.synchronize() on mps (#143171)
# Motivation
Support `torch.accelerator.synchronize()` on mps. The root cause is that MPS doesn't support lazy initialization. So we must check if the current accelerator supports device lazy initialization rather than early return.

# Additional Context
Add a mps UT to test code change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143171
Approved by: https://github.com/albanD
2024-12-16 02:18:32 +00:00
cyy
af8789c056 Hide torch_python symbols (#142214)
Change symbols in torch_python to invisible by default on platforms other than Apple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214
Approved by: https://github.com/ezyang
2024-12-16 00:59:26 +00:00
744a303dee [FlexAttention] Optimzing learned bias perf to dq calc (#142281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142281
Approved by: https://github.com/Chillee
2024-12-15 21:44:32 +00:00
e0bdae7884 [AMD] Turn on TF32 for aten::mm (#139869)
Summary: hipblaslt supports TF32, so adding the support.

Test Plan: CI

Differential Revision: D65435392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139869
Approved by: https://github.com/leitian
2024-12-15 10:02:29 +00:00
5273d8fd2a [audio hash update] update the pinned audio hash (#143265)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143265
Approved by: https://github.com/pytorchbot
2024-12-15 03:41:14 +00:00
9ed045eae9 Revert "[Profiler] Add Optional Flag to turn off external correlations (#142516)"
This reverts commit b29fc52f827cc4b4336ecd24cc0a019ec9cf24b6.

Reverted https://github.com/pytorch/pytorch/pull/142516 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/142516#issuecomment-2543431758))
2024-12-15 03:34:37 +00:00
dd2d360b7d [ca] re-enable disabled tests (#143247)
FIXES https://github.com/pytorch/pytorch/issues/133197

The unspecified floats PR landed while this test was disabled, and it added an analysis restart which counts towards the backend call counter the test is using

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143247
Approved by: https://github.com/zou3519
2024-12-15 02:11:39 +00:00
cyy
4273e1a059 [5/N] Apply bugprone-unchecked-optional-access (#143111)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143111
Approved by: https://github.com/Skylion007
2024-12-15 01:07:28 +00:00
91bf2e16de [distributed] Remove unused variable in test_composability/test_pp_composability.py (#143191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143191
Approved by: https://github.com/mori360
2024-12-14 12:23:44 +00:00
de484134e4 support slicing with symints in non-strict (#143217)
Differential Revision: [D67215745](https://our.internmc.facebook.com/intern/diff/D67215745/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143217
Approved by: https://github.com/tugsbayasgalan
2024-12-14 10:27:45 +00:00
9933e59c2b [torch][cuda] fix race condition in cuda initialization (#143238)
The access to lazy init callbacks (`_lazy_seed_tracker` and `_queued_calls`) is not synchronized with the initialization lock.

This exposes us to the following race:
1. start `_lazy_init`
2. take `_initialization_lock`
3. flush `_queued_calls` and run them all
4. another thread comes in and uses `_lazy_call` to put something on the queue (in our case, the `manual_seed`)
5. original thread finishes initializing, but never runs that call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143238
Approved by: https://github.com/ngimel
2024-12-14 07:41:24 +00:00
28d8297712 Migrate compiler config to Config (#143152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152
Approved by: https://github.com/ezyang
ghstack dependencies: #143229
2024-12-14 07:38:25 +00:00
7c4d29485e Add typechecking indirection for Config (#143229)
When we create a Config[T], we actually dynamically unbox this in the module, so lets have type checker believe that Config[T] creates a T. This enables proper typechecking support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143229
Approved by: https://github.com/aorenste
2024-12-14 07:38:25 +00:00
be5b342332 [Inductor] Move peak memory pass and overlap pass to be run at the right place (#142822)
This PR moves `decide_global_ordering_of_comms` to run first before all other Inductor scheduler passes, so that downstream passes have the correct dependency tracking info. It also moves peak memory pass and overlap pass to the end of all passes, because they need to be the final decision maker on the node order to achieve the desired peak memory and overlap.

This PR fixes hard-to-debug peak memory pass errors caused by incorrect tracking in `.unmet_dependencies` during the enablement of SimpleFSDP on internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142822
Approved by: https://github.com/eellison
2024-12-14 06:53:02 +00:00
3cc617b6a7 __cuda_array_interface__: Use "<V2" for bfloat16. (#143042)
Rationale: While Numpy doesn't support `bfloat16` and therefore there's no official typestr for `bfloat16` in `__array_interface__` (https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__), JAX/ml_dtypes uses "<V2":

```
>>> from jax import numpy as jnp
>>> jnp.bfloat16.dtype.str
'<V2'
```

Using the same in PyTorch has the upside of making the typestrs returned by `__cuda_array_interface__` identify the torch dtype uniquely.

### Misc notes

(1) JAX itself just refuses to do `__cuda_array_interface__` for `bfloat16`:

```
>>> from jax import numpy as jnp
>>> jnp.arange(10, dtype=jnp.bfloat16).__cuda_array_interface__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
jaxlib.xla_extension.XlaRuntimeError: INVALID_ARGUMENT: __cuda_array_interface__ is not supported for bfloat16 buffers.
```

(2) The "official" description of `__cuda_array_interface__` doesn't mention bfloat16, it just references `__array_interface__`: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html

(3) Ongoing issue for numpy to support bfloat16: https://github.com/numpy/numpy/issues/19808

(4) Tweet that triggered this: https://x.com/HeinrichKuttler/status/1866761979349844211, with @ezyang responding.

(5) "<V2" is kinda weird, as it's a "little-endian void" type. When given to Numpy, it gets turned into endian-agnostic:

```
>>> import numpy as np
>>> import ml_dtypes
>>> np.dtype("bfloat16").str
'<V2'
>>> np.dtype("<V2").str
'|V2'
```

Still, it makes sense to have a unique string for `bfloat16` and since Google chose "<V2" we might as well use that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143042
Approved by: https://github.com/ezyang
2024-12-14 06:27:52 +00:00
c0a39ad35a [ROCm] Fix TunableOp UTs: Rotating Buffer (#143172)
TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value.

Additional items in this PR:
* UT for rotating buffer API
* Clean up UTs that were setting the rotating buffer via the environment variable
* Align behavior of environment variable and Python API when a negative value (< 0) is set.
* Update documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172
Approved by: https://github.com/jeffdaily
2024-12-14 06:18:11 +00:00
96c3b2c388 Expose remaining sharedMem cudaDeviceProps to python (#143226)
Was a bit too fast with my earlier PR, `sharedMemPerMultiprocessor` includes some memory that is reserved for the system. The amount a kernel can actually use is limited by `sharedMemPerBlockOptin`.

I also expose `sharedMemPerBlock` for completeness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143226
Approved by: https://github.com/ezyang
2024-12-14 06:13:28 +00:00
cyy
4764303cc6 Use static initialization to avoid once_flag in getCUDAHooks (#143198)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143198
Approved by: https://github.com/albanD
2024-12-14 06:05:41 +00:00
23379e8933 Add torch._compile to uninteresting files (#143209)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143209
Approved by: https://github.com/albanD
2024-12-14 05:40:21 +00:00
ca973069ed Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2024-12-14 03:53:28 +00:00
24f24eebde Get rid of _lazy_import hack (#143213)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143213
Approved by: https://github.com/aorenste, https://github.com/albanD
2024-12-14 03:46:21 +00:00
698eefaddd [audio hash update] update the pinned audio hash (#143245)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143245
Approved by: https://github.com/pytorchbot
2024-12-14 03:37:56 +00:00
cyy
e9f6045e80 [15/N] Fix extra warnings brought by clang-tidy-17 (#143100)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143100
Approved by: https://github.com/Skylion007
2024-12-14 03:24:10 +00:00
33dee721ae Reraise worker errors as runtime errors in more cases when the original exception can't be constructed (#140911)
related to https://github.com/pytorch/pytorch/issues/34130

when pytorch attempts to re-raise an exception from a worker process (e.g. multiprocessing dataloader), if it can't reconstruct the original exception message due to a type error, it instead raises it as a runtime error. However, if it can't reconstruct the exception for some other reason, it throws an error with a stacktrace pointing to the `ExceptionWrapper` code rather than the original underlying issue.

One case in which I run into this is with boto3's [HTTPClientError](66dc1f8d52/botocore/exceptions.py (L94))s. They must be constructed with a keyword argument `error`, but if `error` isn't passed, a `KeyError` is thrown instead of a `TypeError`, due to the particular way it is implemented:

* [HTTPClientError](66dc1f8d52/botocore/exceptions.py (L94))'s constructor excepts variable keyword arguments it passes to `super` (BotoCoreError)
* [it also defines a field `fmt` with `error`](66dc1f8d52/botocore/exceptions.py (L95))
* BotoCoreError [expects to be able to format that string with the kwargs](66dc1f8d52/botocore/exceptions.py (L41))

So in this case, if a HTTPClientError occurs on a worker process, you simply get a `KeyError: error` with a stacktrace pointing to [this line](3e2f276a14/torch/_utils.py (L710)) which is unhelpful.

Instead, I propose to reraise the error as a `RuntimeError` unconditionally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140911
Approved by: https://github.com/vmoens
2024-12-14 03:11:36 +00:00
cdc03f99b7 [ca] add graph id (#141906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141906
Approved by: https://github.com/jansel
ghstack dependencies: #141919
2024-12-14 03:02:06 +00:00
19f3570000 [EZ] Remove --pre from numpy installation command (#143237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143237
Approved by: https://github.com/janeyx99, https://github.com/kit1980
2024-12-14 02:55:21 +00:00
bf8d4f5b7a [Inductor UT] Generalize device-bias code in test_triton_syntax.py. (#143178)
Fix #143177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143178
Approved by: https://github.com/eellison
2024-12-14 02:08:32 +00:00
86c3370bc3 operator benchmark: write output to a JSON (#142809)
This pull request adds the functionality of writing the output of operator benchmark to an optional JSON file specified. The output is still printed in the terminal like before, but the user has the option of saving it in a JSON file as well.

Main part of the functionality is implemented using the function _perf_result_to_dict which outputs a dictionary to be put inside a JSON file. Each dictionary corresponds to a single test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142809
Approved by: https://github.com/albanD
2024-12-14 01:42:00 +00:00
12098ad242 Add torch.cat tensors type promotion description (#141339)
Fixes #126964

Add note description about type promotion of `torch.cat`

**Test Result**

**Before**
![image](https://github.com/user-attachments/assets/2449f11b-48ed-406e-b73e-6d00f8eadb00)

**After**
![image](https://github.com/user-attachments/assets/cba99572-e8b1-4b9c-ba95-a963b54859ba)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141339
Approved by: https://github.com/albanD
2024-12-14 01:36:41 +00:00
13233e062d Fix Apple Clang ICE when building with -march=armv8.6a (#142879)
When investigating #142703, I found that the build with -march=armv8.6 on my M1 mac was hitting a clang ICE. When looking at the blame code, I finally noticed that this constructor was nonsense, apparently in a way that the compiler frontend accepted but the backend choked on.

example ICE error message:
```
fatal error: error in backend: Cannot select: 0x12689c260: bf16 = uint_to_fp 0x1258324a0
  0x1258324a0: i32 = AssertZext 0x125822d90, ValueType:ch:i16
    0x125822d90: i32,ch = CopyFromReg 0x1238dddc0, Register:i32 %22
      0x12689c6c0: i32 = Register %22
In function: _ZN2at6native7DEFAULTL12logit_kernelERNS_18TensorIteratorBaseERKN3c106ScalarE
c++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Apple clang version 16.0.0 (clang-1600.0.26.3)
Target: arm64-apple-darwin24.1.0
Thread model: posix
```

Unbreaks `env CFLAGS=-march=armv8.6-a CXXFLAGS=-march=armv8.6-a python setup.py develop --cmake` on M1 Mac.

Differential Revision: [D67102953](https://our.internmc.facebook.com/intern/diff/D67102953/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142879
Approved by: https://github.com/malfet
2024-12-14 01:07:01 +00:00
063194aa32 add additional CK BMM Instances (2) (#142874)
Summary: stacked changes to keep new codegen-ed instances below 2000 LOC

Reviewed By: zjing14

Differential Revision: D66985408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142874
Approved by: https://github.com/mxz297
2024-12-14 01:04:34 +00:00
00b0210139 [Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
2024-12-14 00:27:55 +00:00
d53164880f dont attempt to fuse in unaligned accesses to mm (#142435)
This isn't profitable - we were trying to fuse in a padding of unaligned mm, which defeats padding's purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142435
Approved by: https://github.com/jansel
ghstack dependencies: #142401, #142402
2024-12-14 00:22:31 +00:00
70be7900bb Fix Tensor clear to properly clear slots (#143203)
Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/137267

While the test ensures the finalizer did run to make sure things are cleared, the objects are not properly collected by the gc due to the faulty tp_clear implementation. So, while the finalizer did run, the object was still alive.
Fixing this by giving tp_clear the same treatment as tp_traverse and tp_dealloc on Tensor: make it a unique function that handles the full subclass hierarchy in one place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143203
Approved by: https://github.com/ezyang, https://github.com/colesbury
ghstack dependencies: #143202
2024-12-14 00:17:07 +00:00
8741d72e3c move function before modifying it (#143202)
This is a no-op. Just to make the diff in the next PR easier to read

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143202
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2024-12-14 00:17:07 +00:00
3bfdf6f063 Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218)
Follow up after https://github.com/pytorch/pytorch/pull/143162
Include triton only for 3.13 packages not 3.13t
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143218
Approved by: https://github.com/kit1980
2024-12-14 00:12:45 +00:00
515abb7744 [CI] Add Triton 3.13t build (#143212)
By just extending the matrix and invoking script with appropriate cpython runtime
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere
2024-12-13 23:45:47 +00:00
8621b9ff0c Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402)
For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics.

Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box.

We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402
Approved by: https://github.com/jansel
ghstack dependencies: #142401
2024-12-13 23:25:15 +00:00
4e0de50eb5 Revert "[CI] Add Triton 3.13t build (#143212)"
This reverts commit 571cd92d7c4c7bd2d5f068b5a285e0e70b8d0a40.

Reverted https://github.com/pytorch/pytorch/pull/143212 on behalf of https://github.com/janeyx99 due to lint is failing, the other failures don't seem relevant but ci has turned red after this change haha ([comment](https://github.com/pytorch/pytorch/pull/143212#issuecomment-2542521875))
2024-12-13 23:03:45 +00:00
f406207af2 Revert "[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827)"
This reverts commit 1e2b841675e50a6abd8dab9a95b33fda64b12e2b.

Reverted https://github.com/pytorch/pytorch/pull/142827 on behalf of https://github.com/jeffdaily due to prematurely dropped support for gfx900/gfx906 ([comment](https://github.com/pytorch/pytorch/pull/142827#issuecomment-2542507857))
2024-12-13 22:48:44 +00:00
ad2faec8bb Add a pass which analyzes whether a prologue preserves zero mask (#142401)
We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask:
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tl.where(a_mask, tmp1, 0.0)
```
now we do not need to ->
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tmp1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401
Approved by: https://github.com/jansel
2024-12-13 22:37:33 +00:00
b29fc52f82 [Profiler] Add Optional Flag to turn off external correlations (#142516)
Summary: External Correlations are super spammy and oftentimes not even useful. Add flag during init to remove them entirely

Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Dec_10_12_33_31.531106.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D67048206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142516
Approved by: https://github.com/ngimel
2024-12-13 22:32:09 +00:00
bb574abe73 [BC-Breaking]Remove capture_pre_autograd_graph references in quantization (#139505)
Summary:
As title

This is a BC-breaking change because graph produced by "capture_pre_autograd_graph" cannot be input to quantization anymore. But this is ok, since this API is deprecated for a while and is going to be deleted. We have removed all call sites of it.

We remove the deprecated API references in code, docs, and tests.

We also removed two tests that specific to capture_pre_autograd_graph API.

Test Plan: CI

Differential Revision: D65351887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139505
Approved by: https://github.com/tugsbayasgalan, https://github.com/andrewor14, https://github.com/jerryzh168
2024-12-13 22:26:22 +00:00
d25e6e623f Fix unused Python variables in test/[a-d]* (#134665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665
Approved by: https://github.com/albanD
2024-12-13 22:13:12 +00:00
e19f493f02 add private config to temporarily preserve old FSDP guard behavior (#142871)
Summary: https://github.com/pytorch/pytorch/pull/138819 wobbled dynamo guards in a way that caused some performance regression, so this PR temporarily adds a config to get the old behavior back while we investigate.

Test Plan: CI

Differential Revision: D67096751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142871
Approved by: https://github.com/yf225
2024-12-13 22:06:48 +00:00
8fae4397b4 Add "inductor_pre_grad_graph" logging (#142717) (#143126)
Summary:

Add new structured logging "inductor_pre_grad_graph"

This is for inductor provenance tracking front-end to load this graph from tlparse.
ghstack-source-id: 257581974
exported-using-ghexport

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' //caffe2/test/dynamo:test_dynamo -- -r StructuredTraceTest
```

Differential Revision: D67150288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143126
Approved by: https://github.com/desertfire
2024-12-13 21:48:25 +00:00
8a04018329 [MPS] Fix conv backward for channels last (cont) (#143196)
This is a continuation of https://github.com/pytorch/pytorch/issues/140902 but extends the same logic to input.

Looks like existing channels-last logic just produced incorrect results on pre MacOS-15 versions and fails on MacOS-15, so removing it feels like a right idea

Fixes https://github.com/pytorch/pytorch/issues/142344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143196
Approved by: https://github.com/manuelcandales
2024-12-13 21:32:42 +00:00
571cd92d7c [CI] Add Triton 3.13t build (#143212)
By just extending the matrix and invoking script with appropriate cpython runtime
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere
2024-12-13 21:28:52 +00:00
60c54467db [logging] Log runtime autotuning timing to scuba (#141919)
See test plan in internal diff [D66679369](https://our.internmc.facebook.com/intern/diff/D66679369)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141919
Approved by: https://github.com/jamesjwu, https://github.com/ezyang
2024-12-13 21:22:13 +00:00
0d6d29af38 [CUDA] Follow up to clean up some set_per_process_memory_fraction usage in tests (#142811)
follow-up to #140852 now that #140620 has landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142811
Approved by: https://github.com/Skylion007
2024-12-13 21:09:05 +00:00
65d0a25289 [associative_scan] patch inductor tests to always run with static shape (#143161)
fixes #143053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143161
Approved by: https://github.com/eellison
2024-12-13 21:06:12 +00:00
52f31cc238 dynamo tracing perf: Guard slots: 51.76 -> 51.34 (#143060)
See #143056 for overall docs.

This PR: Add slots to Guard
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143060
Approved by: https://github.com/jansel
ghstack dependencies: #143066, #143056, #143058, #143059
2024-12-13 21:02:50 +00:00
e87f07d3b8 Revert "Migrate compiler config to Config (#143152)"
This reverts commit 1ebdfd56053dafa8880a0dedf535fff70aa92e09.

Reverted https://github.com/pytorch/pytorch/pull/143152 on behalf of https://github.com/oulgen due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/143152#issuecomment-2542342073))
2024-12-13 20:55:14 +00:00
625b4edb97 [CD] Test torch.compile on 3.13 (#143207)
Follow up after https://github.com/pytorch/pytorch/pull/143162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143207
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2024-12-13 20:01:36 +00:00
fe9365f3f5 Add check_binary workflow to pytorch/pytorch (#143201)
Migrated from pytorch/builder
Related to: https://github.com/pytorch/builder/issues/2054

Copying from : 3468139e81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143201
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-12-13 19:30:10 +00:00
8f40446770 Fix precedence of bitwise and/or printing (#143197)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143197
Approved by: https://github.com/albanD, https://github.com/williamwen42
2024-12-13 19:29:42 +00:00
1ebdfd5605 Migrate compiler config to Config (#143152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152
Approved by: https://github.com/ezyang
ghstack dependencies: #143150, #143151
2024-12-13 19:29:07 +00:00
f1ff8bc1c5 Add type to Config (#143151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143151
Approved by: https://github.com/ezyang
ghstack dependencies: #143150
2024-12-13 19:29:07 +00:00
9d05c8110d Require Config to have a default (#143150)
With aliases coming soon, we want to reject alias + default combo, so we need defaults to be passed in. On top of this, this simplifies statically type checking config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143150
Approved by: https://github.com/ezyang
2024-12-13 19:28:59 +00:00
bf711a9cce [ROCm] Improve performance of reduce sum for 3D shapes (#143137)
Improve performance of reduce sum for 3D shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143137
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-12-13 19:02:00 +00:00
6178be822d dynamo tracing perf: direct Guard: 52.58 -> 51.76 (#143059)
See #143056 for overall docs.

This PR: Remove explicit constant check from `VariableBuilder.install_guards()`
the args calling convention.  Also remove a lambda binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143059
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #143066, #143056, #143058
2024-12-13 18:20:48 +00:00
6bcda3a21a dynamo tracing perf: cache on import_source: 52.9 -> 52.58 (#143058)
See #143056 for overall docs.

This PR: add cache to `InstructionTranslatorBase.import_source()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143058
Approved by: https://github.com/jansel
ghstack dependencies: #143066, #143056
2024-12-13 18:20:48 +00:00
b472d82c96 dynamo tracing perf: import in build: 60.48 -> 59.92 (#143056)
A series of directed perf improvements to drive down the dynamo tracing cost of
the given test. Before this PR stack the compile took about 60s, and after takes
30s. Individual improvements are listed below along with the approximate
improvement of that change.

Tested with this model:
```
@torch.compile(backend="eager")
def model_add(x, y):
    out = x
    for i in range(5000):
        out = torch.add(out, y)
    return out
```

This PR: Stop importing builder in the inner loop of `VariableTracker.build()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143056
Approved by: https://github.com/jansel
ghstack dependencies: #143066
2024-12-13 18:20:48 +00:00
63e1f97f4b dynamo tracing perf: don't unnecessarily call getframeinfo on the hot path: 47.26 -> 37.66 (#143066)
See #143056 for overall docs.

This PR: Stop using `getframeinfo()` when we only care about the function name
and throw the rest away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143066
Approved by: https://github.com/jansel
2024-12-13 18:20:48 +00:00
e0c8abda76 Fix potentially undefined behaviour in index_put sample input (#143116)
From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_:

> If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements.

Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`.

This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116
Approved by: https://github.com/albanD
2024-12-13 17:59:01 +00:00
23b8ea3094 Allow disabling int specialization on nn.Modules (#142829)
Resolves issue #140464 by adding an option to not specialize int from nn.Modules (False by default to maintain existing behavior).

Test Plan: `buck2 test mode/opt caffe2/test/dynamo:test_dynamo -- test_modules.py::NNModuleTests::test_nn_module_unspec_int_attr`

Differential Revision: D66837042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142829
Approved by: https://github.com/ezyang, https://github.com/yanboliang
2024-12-13 17:26:11 +00:00
82a45d19b4 Expose sharedMemPerMultiprocessor device property to python (#143119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119
Approved by: https://github.com/ezyang
2024-12-13 16:53:57 +00:00
3f62054de1 [ROCm] upgrade nightly wheels to rocm6.3 - 1 of 2 (docker images) (#142151)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142151
Approved by: https://github.com/jeffdaily
2024-12-13 16:21:17 +00:00
7968732f5b Fix int8 mm V.ops.mul dispatching (#143127)
This is sort of subtle - because we were doing `V.ops.mul` at binding time, we dont redispatch later when we invoke the epilogue. and then later running into assertion checking in pr above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143127
Approved by: https://github.com/drisspg
ghstack dependencies: #143048
2024-12-13 16:17:23 +00:00
da67a6a7bb [inductor] Replace set by OrderedSet (#138466)
Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454
and considerable manual editing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466
Approved by: https://github.com/eellison
2024-12-13 16:08:45 +00:00
fbfc530442 [export][ez] Fix forward D67044185 (#143193)
Summary: Fixing forward D67044185 and T210459833 by adding the missing buld file.

Test Plan: buck2 build --flagfile fbcode//mode/opt fbcode//admarket/training_data/augmentation/processors/tests:model_manager_test

Differential Revision: D67200056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143193
Approved by: https://github.com/tugsbayasgalan
2024-12-13 16:06:42 +00:00
04bb82f097 Linux Wheels: Remove triton dependency python < 3.13 constraint (#143162)
We do build pytorch-triton package for python 3.13 : https://github.com/pytorch/pytorch/actions/runs/12304476674/job/34344764271
Hence constraint is no longer needed.
This stack enabled torch.compile for Python 3.13 : https://github.com/pytorch/pytorch/pull/141264
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143162
Approved by: https://github.com/kit1980
2024-12-13 15:08:44 +00:00
810808d97d Enable cutlass-based all-gather matmul when TORCH_SYMM_MEM_ENABLE_NATIVE_ASYNC_TP is set (#142283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142283
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-12-13 10:29:14 +00:00
3e1f587514 [AOTI] Fix an autotune block grid computation issue (#143098)
Summary: There is a grid computation issue after switching to one-pass codegen in https://github.com/pytorch/pytorch/pull/141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases.

Reviewed By: henrylhtsang

Differential Revision: D67120987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143098
Approved by: https://github.com/henrylhtsang
2024-12-13 07:52:30 +00:00
9f90583ca2 [CI] Run aarch64 tests on Graviton3 (#143129)
Which is armv8.6 that has SVE and BF16 capability

mkldnn_pattern_matcher skips are tracked in https://github.com/pytorch/pytorch/issues/143146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143129
Approved by: https://github.com/digantdesai
2024-12-13 07:39:22 +00:00
c37185c76a [BE] Stop using deprecated APIs in mkldnn_pattern_matcher (#143156)
This should fix
```
/var/lib/jenkins/workspace/test/inductor/test_mkldnn_pattern_matcher.py:157: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143156
Approved by: https://github.com/kit1980
2024-12-13 06:37:20 +00:00
cyy
075905b7bd [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2024-12-13 06:22:13 +00:00
72fd7abb35 [ca] fix flex attention backward HOP capture in initial graph (#143155)
FIXES https://github.com/pytorch/pytorch/issues/142313

So with previous HOPs, compiled autograd could just inline into their body and get their post-dispatch aten representation. You can't do that with this flex attention HOP, which just wants any proxy tracing mechanism to insert it into its graph. Okay, compiled autograd does use proxy tracing, so we can do that.

This is safe because other than the reenter_make_fx call, there were no other make_fx internals usage in the HOP. And compiled autograd specializes on the AOT backward's saved symints which should cover any changes in shapes to the inputs of the HOP.

However, there's still an issue: Dynamo doesn't know how to handle `FlexAttentionBackwardHOP` and will graph break, so the flex attention backward is running in eager as of this PR. The tlparse looks really scuffed after the compiled autograd capture: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpMMHBEH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143155
Approved by: https://github.com/drisspg
2024-12-13 06:04:39 +00:00
b4f4c75e19 [dynamo] Support multiple inheritance for custom dict construction (#142416)
This patch applies a local and practical workaround for custom dict
construction when multiple inheritance is involved.

Handling multiple inheritance in general could be a lot more involved,
so I created #142414 to track that.

Fixes #141118.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416
Approved by: https://github.com/jansel
2024-12-13 05:13:05 +00:00
b5d8d2444a add README.md for compile time benchmarks (#143145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143145
Approved by: https://github.com/laithsakka
ghstack dependencies: #141517, #143143
2024-12-13 05:12:26 +00:00
b7ad52abb0 Use new group instead of split group on non-CUDA device (#141469)
Motivation:

Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-13 05:11:33 +00:00
57c46af47a [Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt (#142110)
### Summary

Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) -

- int8 quantized (symmetrically) activation (per token quantized).
- Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled).

The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`.

We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).

In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).

### More details

oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.

The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op.

The speedup over eager-mode is due to 2 reasons -
1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided).
2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time.

But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.

### Verification

Added UT in this PR
```
python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm
```

#### Corresponding torchao UTs

1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`.
The difference from #139595 is that there are no reshapes of the linear output in this pattern.

2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights -  ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142110
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #142036
2024-12-13 04:59:03 +00:00
b731ced91f Prologue Fusion (#134532)
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.

Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`

And the modification api:

```
{{ modification(
    subgraph_number=0,
    output_name="post_mod_scores",
    score="qk",
    out="qk"
) | indent_except_first(1) }}
```

We have:

```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```

Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.

There are a couple main use cases for prologue fusion:

- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.

Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066

Other notes:

By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.

With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
2024-12-13 04:18:25 +00:00
ceb664aca6 add float_args benchmark (#143143)
71% improvement with automatic dynamic float arguments

with specialize_float=False
```
float_args,compile_time_instruction_count,346293869
```

with specialize_float=True
```
float_args,compile_time_instruction_count,1198546486
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143143
Approved by: https://github.com/laithsakka
ghstack dependencies: #141517
2024-12-13 03:35:59 +00:00
ab04f3aee1 [ca] set autograd graph task state (#143108)
GraphTask holds metadata needed for a single execution of backward(), it is 1:1 with backward calls, at least for compiled autograd. It is used for certain torch._C global autograd state APIs.

In SAC, we use torch._C._current_graph_task_id() as a dict key to store information during unpack hook execution: a5fb07af27/torch/utils/checkpoint.py (L1128)

If we don't set an active task, it will randomize the key, and will do its logic as if each unpacked tensor was from a different graph task
a5fb07af27/torch/utils/checkpoint.py (L1112-L1115)

The sketchy part of this PR is that in eager autograd, GraphTask is mutated during execution. But inspecting the struct, the mutation seems to only be used to communicate between autograd threads (created when multiple devices are involved) or for deprecated uses. We shouldn't run into the mutation case at all in compiled autograd. Also, only the graph task id is accessible from python hooks.

FIXES https://github.com/pytorch/pytorch/issues/142862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143108
Approved by: https://github.com/jansel, https://github.com/albanD
2024-12-13 03:10:48 +00:00
dbe4b69df0 [Inductor] Fix cooperative reduction tests broken in recent refactor (#143135)
These tests were broken by https://github.com/pytorch/pytorch/pull/142020. This PR updates the fixed configs accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143135
Approved by: https://github.com/jansel, https://github.com/huydhn
2024-12-13 02:03:43 +00:00
cyy
9f5ebf3fc6 Clang-format aten/src/ATen/native/Tensor*{cpp,h} (#143089)
These files are relatively stable, so it should be safe to format them without incurring conflicts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143089
Approved by: https://github.com/albanD
2024-12-13 00:06:48 +00:00
2533a5a843 upgrade sccache to 0.9.0 (#142854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142854
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2024-12-12 22:49:50 +00:00
fb93462904 [Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#142036)
Reopen of https://github.com/pytorch/pytorch/pull/139595

**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`

The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.

Note that `onednn.qlinear_pointwise` only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.

**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```

Differential Revision: [D66796966](https://our.internmc.facebook.com/intern/diff/D66796966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142036
Approved by: https://github.com/jerryzh168, https://github.com/jgong5

Co-authored-by: sanchitintel <sanchit.jain@intel.com>
2024-12-12 21:18:03 +00:00
602c86a420 [DSD] Fix strict=False case for DDP (#143038)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143038
Approved by: https://github.com/mori360
2024-12-12 21:15:21 +00:00
a7509e98c5 [pipelining] fix backward_one_chunk when the output of the model is a… (#142237)
fixes #142229

if any of ``stage_output`` is a view, it cannot be detached in place. Replacing it with ``t = t.detach()`` or similar would not free the graph for the output given to the user. Detaching the base tensor could cause a side effect.

The same code is used in ``_backward.py`` (b64a537993/torch/distributed/pipelining/_backward.py (L215)) but does not seem to cause any issue in my case. Maybe needs some investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142237
Approved by: https://github.com/H-Huang
2024-12-12 20:59:35 +00:00
39cacc1d81 Fix missing tests on test tool lint job (#143052)
A follow-up from https://github.com/pytorch/pytorch/pull/142476#discussion_r1878888558 where some tests are not discovered correctly by pytest

### Testing

https://github.com/pytorch/pytorch/actions/runs/12287448581/job/34289531307?pr=143052#step:14:162 shows the correct number of tests now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143052
Approved by: https://github.com/ZainRizvi
2024-12-12 20:29:32 +00:00
82ce888273 c10::string_view -> std::string_view in more places (#142517)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517
Approved by: https://github.com/malfet
2024-12-12 19:45:59 +00:00
0b75b7ff2b [Easy] factor out inductor ophandler decompositions (#142400)
Factor out inductor operator decompositions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-12-12 19:03:26 +00:00
c170248b78 [Profiler] Enable Iterative Step without profiler in fbcode (#142077)
Summary: Adds post optimizer hook for fbcode so that we can run iterative on demand without having to use a frontend profiler interface. Since this is being used more frequently, it would be convenient for users to be able to trigger this on-demand feature without having to worry about being within some timing window.

Test Plan: Ran iterative tracing without profiler.profile

Differential Revision: D66734119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142077
Approved by: https://github.com/briancoutinho
2024-12-12 19:00:13 +00:00
e3fe5f62b6 Remove Checkout pytorch/builder for Linux Binary Builds (#143125)
Follow Up after: https://github.com/pytorch/pytorch/pull/142282

Remove Checkout pytorch/builder for Linux Binary Builds
I believe we where not using builder already. Hence remove this checkout.
We should be using scripts from this folder:
```
/pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh
```

TODO: Will followup with removing BUILDER_ROOT everywhere from PyTorch repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143125
Approved by: https://github.com/kit1980
2024-12-12 18:55:00 +00:00
d48b16a725 Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847)"
This reverts commit 357e261b1eded933d98de18ddcef2b083f87259d.

Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/atalman due to Breaks binary builds, see the comment above ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2539759580))
2024-12-12 18:44:35 +00:00
b0c3d39e0d [pipelining] Update tutorials and documentation (#143045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-12 18:42:17 +00:00
ee5bceaee6 [sigmoid] Write the new export schema format to archive without breaking compatibility. (#142511)
Summary:
This diff make it possible to migrate to PyTorch's OSS export schema from sigmoid. Basically, we add a new field called "methods" to ExportedProgram in Model definition, which contains the thrift schema generated based on schema.py from OSS. This way, we can keep writing the old fields while double write a new format in equivalent form. Since thrift doesn't support inlining type definitions, we do it manually here and it shouldn't break on-wire compatibility. As long as every sigmoid user is using sigmoid.frontend.serialization.serialize, we always guarantee to have the new format saved sa well.

Eventually we will will use json deserialization from OSS so we will only keep this double writing for a couple of months. Eventually, we will migrate every serialization path to the OSS workflow.

Test Plan:
buck test mode/opt sigmoid/frontend:serialization_test
buck test mode/opt sigmoid/frontend/test_gpu:serializer_test

Differential Revision: D67044185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142511
Approved by: https://github.com/desertfire
2024-12-12 18:41:10 +00:00
5dabe2d464 Fix NJT backward tests (#143072)
This PR fixes some issues with NJT backward / compile backward tests:
1. `requires_grad` was not being propagated appropriately during `SampleInput` generation, so a LOT of backward cases were untested before (sad times). This PR utilizes a helper function `_clone()` to clone() / detach() NJTs for SampleInputs while preserving `requires_grad` status. Note: the clone() / detach() stuff is for autograd; can't have two SampleInputs as part of the same autograd graph.
2. Per-sample skips weren't -fully- working; the op logic would still be invoked even with a skip. I found this out thanks to `split_with_sizes`, which segfaults during backwards because it tries to use an NST-specific formula. As annoying as it is, I tried a ton of things but ultimately had to split the `subtest_ctx` into that + a `skip_xfail_ctx` to run the subtests within.
    * Updated all uses of per-sample skips / xfails: 4 in `test_nestedtensor.py` and 1 in `test_vmap.py`
3. Added the appropriate skips / xfails to get everything passing. There are a shitton of bugs to fix!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143072
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-12-12 18:06:23 +00:00
d47a80246a [dynamo][pytree][3/N] make CXX pytree traceable: tree_map / tree_map_ (#137399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137399
Approved by: https://github.com/jansel
ghstack dependencies: #137398
2024-12-12 18:05:25 +00:00
7edeb1005a [dynamo][pytree][2/N] make CXX pytree traceable: tree_flatten / tree_unflatten / tree_structure (#137398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137398
Approved by: https://github.com/jansel
2024-12-12 18:05:25 +00:00
c85323c5e8 Revert "Tests Generelization for multiple accelerator devices (#139184)"
This reverts commit b576a8c318201b63269f7ff25ec5830d00662a7a.

Reverted https://github.com/pytorch/pytorch/pull/139184 on behalf of https://github.com/clee2000 due to Failing internally when trying to pickle distributed test files D67098795 ([comment](https://github.com/pytorch/pytorch/pull/139184#issuecomment-2539610187))
2024-12-12 17:48:30 +00:00
2f0fe82f6d Revert "[14/N] Fix extra warnings brought by clang-tidy-17 (#141644)"
This reverts commit 24a5a2ef258d2b482ded674cdb9555afaf081402.

Reverted https://github.com/pytorch/pytorch/pull/141644 on behalf of https://github.com/clee2000 due to failing internally D67112938 ([comment](https://github.com/pytorch/pytorch/pull/141644#issuecomment-2539602023))
2024-12-12 17:43:36 +00:00
dc23f1944a Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-12 17:39:14 +00:00
7667235a23 c10::optional -> std::optional (#142514)
Fixes issues introduced in https://github.com/pytorch/pytorch/pull/141348 and https://github.com/pytorch/pytorch/pull/139578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142514
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-12 17:23:46 +00:00
520ba556cd [Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.

# Feature

This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land.

The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR.

These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf.

# Test plan

The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.*` to `r0_.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2024-12-12 17:22:20 +00:00
cf538efd0c Revert "Hide torch_python symbols (#142214)"
This reverts commit da76e912a4c58c649061fc84b29a42714897a0ca.

Reverted https://github.com/pytorch/pytorch/pull/142214 on behalf of https://github.com/huydhn due to The MacOS failure looks legit as it shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/142214#issuecomment-2539543504))
2024-12-12 17:15:51 +00:00
15ee2960e1 [aot] Functionalize aot backward prologue and epilogue wrappers (#142415)
For functional compiled autograd, we're having dynamo trace through the aot backward implementation. To avoid graph breaking and imposing too many restrictions, we allow_in_graph the prologue and epilogue. This adds 2 restrictions:
- code must be available in the global context
- inputs other than tensors/symnodes must be const foldable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142415
Approved by: https://github.com/bdhirsh
2024-12-12 17:14:29 +00:00
30b61e521c [logging] Populate compile_time_autotune_time_us (#143104)
See testing in attached diff

Differential Revision: [D67128210](https://our.internmc.facebook.com/intern/diff/D67128210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143104
Approved by: https://github.com/ezyang
2024-12-12 17:08:43 +00:00
e3ddc0ca33 Support remote caching requiring redis auth (#141679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141679
Approved by: https://github.com/masnesral
2024-12-12 17:07:50 +00:00
0f78be5573 Fix search icon (#142808)
Removing:

.pytorch-left-menu-search input[type=text] {
    background-image: none;
}
so that the search icon correctly appears in the sphinx searchbox

Also, fixing scrolling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808
Approved by: https://github.com/albanD
2024-12-12 16:09:30 +00:00
725526abc5 Fix scan dtypes (#143048)
FIx for https://github.com/pytorch/pytorch/issues/142883. We weren't getting test coverage of scan because the tests were being skipped. see, https://github.com/pytorch/pytorch/issues/143053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143048
Approved by: https://github.com/arui-meta, https://github.com/blaine-rister
2024-12-12 15:57:00 +00:00
d83a049232 [EZ] Update lintrunner in CI to 0.12.7 (#143073)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143073
Approved by: https://github.com/wdvr
2024-12-12 15:35:37 +00:00
7cc3a591c2 [FlexAttention] Fix a few more symbolic shape issues (#142816)
# Summary

See  https://github.com/pytorch/pytorch/issues/139064 for more details. This fixes a number of issues with dynamic shapes. Thanks to @alexdremov for finding most of these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142816
Approved by: https://github.com/yanboliang, https://github.com/ezyang
2024-12-12 15:29:21 +00:00
84f791381a Python 3.13 CI add crossref test to existing linux-focal-py3_13-clang10-build (#143074)
Add  linux-jammy-py3_13-gcc11-build and test - similar to Py 3.9
Add crossref test to existing linux-focal-py3_13-clang10-build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143074
Approved by: https://github.com/malfet
2024-12-12 14:45:56 +00:00
cd1b5924d5 Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)"
This reverts commit 79cf8fa75176a8f6bb79d426c6d0f9369d03ff98.

Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))
2024-12-12 14:42:55 +00:00
30e2b322a1 Add <string> to uninteresting_files (#142984)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142984
Approved by: https://github.com/albanD, https://github.com/IvanKobzarev
2024-12-12 14:35:30 +00:00
91261107e0 debug handler maintain through decomposition (#141612)
Add checks in the ao numberic debugger to guard the debug handle consistency between aten op decomposition

Differential Revision: [D66517480](https://our.internmc.facebook.com/intern/diff/D66517480/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141612
Approved by: https://github.com/jerryzh168
2024-12-12 12:26:45 +00:00
18785c1af9 [BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-12 10:53:48 +00:00
a5fb07af27 [Torch][#142396]Resolve Failure When Uploading To Remote Storage (#143046)
Summary: Catch io.UnsupportedOperation exception so that stream's without fileno support don't cause failure

Test Plan: UT

Differential Revision: D67108487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143046
Approved by: https://github.com/saumishr
2024-12-12 08:17:15 +00:00
497f89ff83 fix dynamo nn module stack fqn (#142823)
Dynamo can produce sources that have funny patterns in their `.name()` that break `nn_module_stack` fqns. Added a test that used to have `._modules` inside nn_module_stack fqns, now doesn't. (Unfortunately couldn't repro a case mentioned in the GH issue where `.slice(...)` is claimed to appear as well.)

Fixes https://github.com/pytorch/pytorch/issues/141939

Differential Revision: [D67064189](https://our.internmc.facebook.com/intern/diff/D67064189/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142823
Approved by: https://github.com/pianpwk, https://github.com/zhxchen17
2024-12-12 07:02:13 +00:00
da76e912a4 Hide torch_python symbols (#142214)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214
Approved by: https://github.com/ezyang
2024-12-12 07:00:54 +00:00
dcb128d495 [ROCm] TunableOp use thread-safe getenv functions (#142274)
Fixes #142403

~~PR fixes breakage due to this commit
8cd7ad8b48~~

PR is a partial reland of this https://github.com/pytorch/pytorch/pull/140594 with a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142274
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-12-12 06:49:26 +00:00
5ad7d5304c [DTensor][random] add HSDP+TP model init test (#143077)
**Summary**
1. Move the model init tests from `DistTensorRandomOpTest` to `DistTensorRandomInitTest`
2. Added a HSDP+TP meta init test to show correct model init result in this use case. Note that this test requires 8 GPUs to run and our CI doesn't have that capacity so this test will be skipped on CI testing. A local run shows that the test passes on a 8-GPU host.

**Test**
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_hsdp_tp_model_meta_init`

<details>
<summary> Test Result </summary>
<img width="3343" alt="image" src="https://github.com/user-attachments/assets/a960c5e6-37bc-49be-9e36-ecc29ed47eb0" />

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143077
Approved by: https://github.com/weifengpy
2024-12-12 06:46:16 +00:00
357e261b1e [Dynamo] only import einops if version is lower than 0.7.0 (#142847)
Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847
Approved by: https://github.com/zou3519
2024-12-12 06:38:22 +00:00
9701c50bdc [Dynamo] Add missing tensor builtins to allowed functions (#142841)
Fixes https://github.com/pytorch/pytorch/issues/141232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142841
Approved by: https://github.com/yanboliang
2024-12-12 06:38:19 +00:00
b25f64b613 Add-o pipefail for all bash scripts (#143050)
Fixes #142380
I have added -o pipefail in all bash scripts in pytorch/.ci/pytorch. Sorry I didn't double-check the submodule in my last PR. Thanks for the correction! Please contact me again if there are any problems with this fix^^. (Actually contributing to the open source community is an assignment for one of my courses and today is the deadline so I rushed to revise it when I saw an email early in the morning. Haha.)
 @ezyang @malfet @huydhn @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143050
Approved by: https://github.com/ezyang, https://github.com/huydhn

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
2024-12-12 06:18:41 +00:00
79cf8fa751 [Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
2024-12-12 05:40:48 +00:00
1e2b841675 [ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827)
Remove gfx900 and gfx906 archs as they're long-in-the-tooth. Should help reduce the increasing size of ROCm binaries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142827
Approved by: https://github.com/jeffdaily
2024-12-12 05:33:40 +00:00
cyy
fda43c98d1 Improve implementation of quantized_batch_norm (#141570)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141570
Approved by: https://github.com/albanD
2024-12-12 04:35:00 +00:00
cyy
20df80a669 Remove unneeded optional dereference (#141578)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141578
Approved by: https://github.com/swolchok
2024-12-12 04:34:43 +00:00
cyy
f7b9533c3f [4/N] Apply bugprone-unchecked-optional-access (#142832)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142832
Approved by: https://github.com/albanD
2024-12-12 04:33:32 +00:00
fbbafd0320 Turn on AOTAutogradCache by default on open source (#141981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141981
Approved by: https://github.com/bdhirsh, https://github.com/oulgen
2024-12-12 04:21:11 +00:00
4d0775462e E2E composability testing (#141398)
Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability
Currently provide @parametrize on
"ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble]
"MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32]

Future work:
1. add fp8
2. add cp(context parallelism) to enable 4D test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-12 04:19:29 +00:00
cyy
2903cf0ad8 Re-enable some C++ warnings (#142332)
It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332
Approved by: https://github.com/albanD, https://github.com/eqy
2024-12-12 04:02:12 +00:00
f892f9862a [ROCM] Enable *_load_dwordx4 ISA for BFloat16 and Half. (#141397)
Remove input_vec_size constexpr and move it to template parameter. This enables generation of vectorized loads in ROCm AMDGPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141397
Approved by: https://github.com/jeffdaily

Co-authored-by: Jerry Mannil <jerry.mannil@amd.com>
2024-12-12 03:27:49 +00:00
4d8357e912 [CD] Use Anaconda cmake for Mac builds (#143054)
To find Anaconda-env-installed OpenMP
(As OpenMP from PyPI is looking for it in a different places)

For posterity: our build script names are very confusing:
 - [`.ci/wheel/build_wheel.sh`](6cb6e8d790/.ci/wheel/build_wheel.sh) is only used for MacOS wheel/libtorch builds
 - [`.ci/manywheel/build.sh`](6cb6e8d790/.ci/manywheel/build.sh) are used for Linux wheel/libtorch builds
 - [`.ci/pytorch/windows/build_pytorch.bat`](6cb6e8d790/.ci/pytorch/windows/build_pytorch.bat) is used for Windows wheel builds

Fixes https://github.com/pytorch/pytorch/issues/142873
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143054
Approved by: https://github.com/Jack-Khuu, https://github.com/atalman
2024-12-12 03:05:46 +00:00
cb354f8b47 [PGNCCL] Move NCCLComm impl to cpp (#142826)
BE as titled. No behavior change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142826
Approved by: https://github.com/wconstab, https://github.com/c-p-i-o
2024-12-12 02:45:52 +00:00
06075d3d18 [Inductor][CPP] Fix Mask Dtype mismatch (#142103)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103
Approved by: https://github.com/jgong5
2024-12-12 01:21:32 +00:00
d68403df3b filelock: Make waitcounter variant to use (#139816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816
Approved by: https://github.com/ezyang
2024-12-12 01:18:34 +00:00
6cb6e8d790 Python 3.11, 3.12 Remove tests covered by 3.13 (#143078)
We do have linux-focal-py3_13-clang10-build and test. Hence removing linux-focal-py3_11-clang10-build/test and linux-focal-py3_12-clang10-build/test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143078
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-12 01:12:00 +00:00
1dd6f21029 Cuda 12.1 - Remove from trunk tests (#143076)
Remove cuda 12.1 from trunk tests. This is covered by 12.4 tests.
Move ``libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build`` -> ``libtorch-linux-focal-cuda12_4-py3_10-gcc9-debug-build``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143076
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-12 01:10:09 +00:00
bd7d81db9e Use validate-docker-images workflow from test-infra (#143081)
After PR: https://github.com/pytorch/test-infra/pull/6029 use validate-docker-images.yml from test-infra.
Related to: https://github.com/pytorch/builder/issues/2054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143081
Approved by: https://github.com/huydhn
2024-12-12 00:24:27 +00:00
cyy
db81a3f31c [TorchGen] remove remove_non_owning_ref_types from valuetype_type (#142449)
It is not used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142449
Approved by: https://github.com/ezyang
2024-12-12 00:15:44 +00:00
1b3f8b7589 Revert "[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)"
This reverts commit 209119424922b135fef39aba1f25da3b67f5879a.

Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))
2024-12-11 21:47:18 +00:00
dfe5669076 Revert "[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677)"
This reverts commit 734bb01460d59e661e9114e7aa17e04821e4b57a.

Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))
2024-12-11 21:47:17 +00:00
cd50bd8477 Revert "[BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)"
This reverts commit fb02b40d27737213e0547dec0e30977dfc50f2f3.

Reverted https://github.com/pytorch/pytorch/pull/140542 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert this in order to revert https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537204202 due to a conflict ([comment](https://github.com/pytorch/pytorch/pull/140542#issuecomment-2537253665))
2024-12-11 21:44:23 +00:00
de313f1155 [foreach_map] Initial foreach map HOP impl for inference (#142098)
This is the initial foreach map HOP for pointwise ops which will be extended in the future to support grouped GEMMs and other ops.

This PR utilizes PrimHOPBase class to represent foreach_map as a HOP with a single subgraph. The way this is implemented is that the user API `foreach_map` provides a single pointwise torch op, and internally this function calls a polyfill which has the same semantics as a foreach op (ie iterates over lists of operands applying the op elementwise). The higher order op is passed through the stack down to inductor where a lowering in essence inlines the subgraph into the main graph. This is done by interpreting it with a pointwise subgraph lowering, grouping the outputs by device, and registering the output buffers as foreach groups as applicable. For testing I was able to reuse the existing foreach tests by creating a wrapper function which matches the foreach op interfaces for those tests and then run all of the existing foreach tests on foreach_map.

TODO before landing:
* Add tests for general functions
* Test warning if unsupported op will block fusion

Followups:
* I need to add tests for backwards (this will be a followup PR because backwards will  require other work as well)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142098
Approved by: https://github.com/eellison
2024-12-11 21:32:11 +00:00
bd199bc754 [EZ] Move slow job from CU12.1 to CU12.4 (#142856)
I though it was migrated a while back

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142856
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi
2024-12-11 21:12:35 +00:00
688f44824b DistributedDataParallel: add init_sync option to control collectives during initialization (#142824)
This controls whether or not we run collectives during the DDP init function. This makes it easier to use fault tolerant ProcessGroup implementations that may not be starting at the same time.

torchft uses a dummy process group and a comm hook to get around these checks. With this change torchft can use the normal ProcessGroup API via the stock comm hook.

https://github.com/pytorch-labs/torchft/blob/main/torchft/ddp.py#L50-L59

Test plan:

```
pytest test/distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142824
Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/H-Huang
2024-12-11 20:28:38 +00:00
fd65bd755d [BE] replace incorrect .. note:: invocations (#142868)
Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868
Approved by: https://github.com/albanD
2024-12-11 19:58:18 +00:00
0b96413dbf Upgrade expecttest to 0.3.0 (#142869)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142869
Approved by: https://github.com/albanD, https://github.com/malfet
2024-12-11 19:04:16 +00:00
cyy
e5f08c0cbf [TorchGen] Remove cpp_type_registration_declarations (#142452)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142452
Approved by: https://github.com/ezyang
2024-12-11 19:01:36 +00:00
cyy
e228381846 [TorchGen] Simplify argument_type_str (#142491)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142491
Approved by: https://github.com/ezyang
2024-12-11 19:01:20 +00:00
42d4eec5f3 Don't install lintrunner on S390 (#142876)
Not sure if there are many users of this platform, but hopefully this will fix https://github.com/pytorch/pytorch/issues/142872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142876
Approved by: https://github.com/jeanschmidt
2024-12-11 18:54:12 +00:00
e647b6d590 Fix undesired specialization on slice after split. (#142372)
Fix: #141251

This PR adds a few static guard checks when decomposing and lowering the `slice`
operation, so that we avoid adding unnecessary guards. Specifically, when clamping the end
values.

In summary, the changes are:

- `slice` dynamo decomposition: checks `end >= sizes[dim]` statically. If we don't know
  that, the following guard ensures that we (don't) need clamping.
- `evaluate_min` inductor `sizevar` function: checks whether we can solve it statically or
  not, before actually creating a new guard.

The latter had to be changed because `evaluate_min` (called by `ir.SliceView` constructor)
would always try to create a guard based on the hints operation result. However, if both
`left` and `right` hints were true, it would default to `left <= right` guard. By checking
the guards statically before, we can avoid that.

```python
N = 16

@torch.compile(backend="inductor", dynamic=False, fullgraph=True)
def fn(x):
    splits = torch.ops.aten.split.Tensor(x, N)
    first = splits[0]
    return torch.ops.aten.slice.Tensor(first, 0, 0, N)

x = torch.arange(N)
torch._dynamo.mark_dynamic(x, 0)

fn(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142372
Approved by: https://github.com/ezyang
2024-12-11 18:52:17 +00:00
0ddb33ba22 [ONNX] Avoid overwriting overlapped decomposed functions (#142831)
Fixes #141770

The decomposed function in `torch.export.default_decompositions().items()` is overwritten by `torch._decomp.decomposition_table`. As from `torch.onnx.export()` perspective, we should rather respect the table of decompositions in `torch.export.default_decompositions().items()` and avoid overwriting it with `torch._decomp.decomposition_table.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142831
Approved by: https://github.com/justinchuby
2024-12-11 18:47:40 +00:00
c632e29774 [hop][dynamo] support torch.SymInt inputs (#141524)
Fixes https://github.com/pytorch/pytorch/issues/141305.

```python
        class M(torch.nn.Module):
            def forward(self, x, y, z):
                a = y.shape[0]
                b = z.shape[0]

                def true_fn(x):
                    return x + a

                def false_fn(x):
                    return x + b * z

                # When exporting with non-strict: a and b are symints,
                # so torch.compile need to wrap and trace symint inputs.
                return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,))
```

In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types.  The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524
Approved by: https://github.com/zou3519
ghstack dependencies: #142185
2024-12-11 18:46:58 +00:00
a8fa98ccef skip test dynamo for aot_dispatch tests on ci (#142185)
A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185
Approved by: https://github.com/zou3519
2024-12-11 18:46:58 +00:00
cyy
24a5a2ef25 [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang
2024-12-11 18:40:42 +00:00
be27dbf2b8 Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088)
Getting tested with ao, but now there is a real test i added.

## What does this PR do?

We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions.

Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules.

So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described.

## How do I know this PR works?

I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that
- torch_python doesn't show up in the ldd tree
- no Py- symbols show up
It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-12-11 18:22:55 +00:00
fb02b40d27 [BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-11 17:57:56 +00:00
cyy
82aaf64422 [3/N] Apply py39 ruff fixes (#142115)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142115
Approved by: https://github.com/ezyang
2024-12-11 17:50:10 +00:00
5727 changed files with 334831 additions and 157957 deletions

View File

@ -3,22 +3,15 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
source $SCRIPTPATH/aarch64_ci_setup.sh
tagged_version() {
GIT_DESCRIBE="git --git-dir /pytorch/.git describe --tags --match v[0-9]*.[0-9]*.[0-9]*"
if ${GIT_DESCRIBE} --exact >/dev/null; then
${GIT_DESCRIBE}
else
return 1
fi
}
if tagged_version >/dev/null; then
export OVERRIDE_PACKAGE_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"
fi
###############################################################################
# Run aarch64 builder python
###############################################################################
@ -27,7 +20,7 @@ cd /
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel
pip install auditwheel==6.2.0
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

View File

@ -5,16 +5,14 @@ set -eux -o pipefail
# By creating symlinks from desired /opt/python to /usr/local/bin/
NUMPY_VERSION=2.0.2
PYGIT2_VERSION=1.15.1
if [[ "$DESIRED_PYTHON" == "3.13" ]]; then
if [[ "$DESIRED_PYTHON" == "3.13" || "$DESIRED_PYTHON" == "3.13t" ]]; then
NUMPY_VERSION=2.1.2
PYGIT2_VERSION=1.16.0
fi
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
source $SCRIPTPATH/../manywheel/set_desired_python.sh
pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2 pygit2==${PYGIT2_VERSION}
pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2
for tool in python python3 pip pip3 ninja scons patchelf; do
ln -sf ${DESIRED_PYTHON_BIN_DIR}/${tool} /usr/local/bin;

View File

@ -4,12 +4,9 @@
import os
import shutil
from subprocess import check_call, check_output
from typing import List
from pygit2 import Repository
def list_dir(path: str) -> List[str]:
def list_dir(path: str) -> list[str]:
"""'
Helper for getting paths for Python
"""
@ -42,7 +39,7 @@ def build_ArmComputeLibrary() -> None:
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v24.09",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
@ -58,7 +55,7 @@ def build_ArmComputeLibrary() -> None:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def update_wheel(wheel_path) -> None:
def update_wheel(wheel_path, desired_cuda) -> None:
"""
Update the cuda wheel libraries
"""
@ -80,7 +77,6 @@ def update_wheel(wheel_path) -> None:
"/usr/local/cuda/lib64/libnvToolsExt.so.1",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
@ -100,6 +96,18 @@ def update_wheel(wheel_path) -> None:
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if "126" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
elif "128" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
else:
libs_to_copy += [
"/opt/OpenBLAS/lib/libopenblas.so.0",
@ -128,6 +136,9 @@ def complete_wheel(folder: str) -> str:
"""
wheel_name = list_dir(f"/{folder}/dist")[0]
# Please note for cuda we don't run auditwheel since we use custom script to package
# the cuda dependencies to the wheel file using update_wheel() method.
# However we need to make sure filename reflects the correct Manylinux platform.
if "pytorch" in folder and not enable_cuda:
print("Repairing Wheel with AuditWheel")
check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)
@ -139,7 +150,14 @@ def complete_wheel(folder: str) -> str:
f"/{folder}/dist/{repaired_wheel_name}",
)
else:
repaired_wheel_name = wheel_name
repaired_wheel_name = wheel_name.replace(
"linux_aarch64", "manylinux_2_28_aarch64"
)
print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")
os.rename(
f"/{folder}/dist/{wheel_name}",
f"/{folder}/dist/{repaired_wheel_name}",
)
print(f"Copying {repaired_wheel_name} to artifacts")
shutil.copy2(
@ -171,22 +189,22 @@ if __name__ == "__main__":
args = parse_arguments()
enable_mkldnn = args.enable_mkldnn
enable_cuda = args.enable_cuda
repo = Repository("/pytorch")
branch = repo.head.name
if branch == "HEAD":
branch = "master"
branch = check_output(
["git", "rev-parse", "--abbrev-ref", "HEAD"], cwd="/pytorch"
).decode()
print("Building PyTorch wheel")
build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
os.system("cd /pytorch; python setup.py clean")
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
if override_package_version is not None:
version = override_package_version
build_vars += (
f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version} PYTORCH_BUILD_NUMBER=1 "
)
elif branch in ["nightly", "master"]:
elif branch in ["nightly", "main"]:
build_date = (
check_output(["git", "log", "--pretty=format:%cs", "-1"], cwd="/pytorch")
.decode()
@ -196,12 +214,11 @@ if __name__ == "__main__":
check_output(["cat", "version.txt"], cwd="/pytorch").decode().strip()[:-2]
)
if enable_cuda:
desired_cuda = os.getenv("DESIRED_CUDA")
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date}+{desired_cuda} PYTORCH_BUILD_NUMBER=1 "
else:
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "
elif branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
if enable_mkldnn:
build_ArmComputeLibrary()
@ -225,6 +242,6 @@ if __name__ == "__main__":
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")
wheel_path = f"/pytorch/dist/{filename[0]}"
update_wheel(wheel_path)
update_wheel(wheel_path, desired_cuda)
pytorch_wheel_name = complete_wheel("/pytorch/")
print(f"Build Complete. Created {pytorch_wheel_name}..")

View File

@ -12,22 +12,22 @@ import os
import subprocess
import sys
import time
from typing import Dict, List, Optional, Tuple, Union
from typing import Optional, Union
import boto3
# AMI images for us-east-1, change the following based on your ~/.aws/config
os_amis = {
"ubuntu18_04": "ami-078eece1d8119409f", # login_name: ubuntu
"ubuntu20_04": "ami-052eac90edaa9d08f", # login_name: ubuntu
"ubuntu22_04": "ami-0c6c29c5125214c77", # login_name: ubuntu
"redhat8": "ami-0698b90665a2ddcf1", # login_name: ec2-user
}
ubuntu18_04_ami = os_amis["ubuntu18_04"]
ubuntu20_04_ami = os_amis["ubuntu20_04"]
def compute_keyfile_path(key_name: Optional[str] = None) -> Tuple[str, str]:
def compute_keyfile_path(key_name: Optional[str] = None) -> tuple[str, str]:
if key_name is None:
key_name = os.getenv("AWS_KEY_NAME")
if key_name is None:
@ -57,7 +57,7 @@ def ec2_instances_by_id(instance_id):
def start_instance(
key_name, ami=ubuntu18_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50
key_name, ami=ubuntu20_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50
):
inst = ec2.create_instances(
ImageId=ami,
@ -96,7 +96,7 @@ class RemoteHost:
self.keyfile_path = keyfile_path
self.login_name = login_name
def _gen_ssh_prefix(self) -> List[str]:
def _gen_ssh_prefix(self) -> list[str]:
return [
"ssh",
"-o",
@ -108,13 +108,13 @@ class RemoteHost:
]
@staticmethod
def _split_cmd(args: Union[str, List[str]]) -> List[str]:
def _split_cmd(args: Union[str, list[str]]) -> list[str]:
return args.split() if isinstance(args, str) else args
def run_ssh_cmd(self, args: Union[str, List[str]]) -> None:
def run_ssh_cmd(self, args: Union[str, list[str]]) -> None:
subprocess.check_call(self._gen_ssh_prefix() + self._split_cmd(args))
def check_ssh_output(self, args: Union[str, List[str]]) -> str:
def check_ssh_output(self, args: Union[str, list[str]]) -> str:
return subprocess.check_output(
self._gen_ssh_prefix() + self._split_cmd(args)
).decode("utf-8")
@ -157,7 +157,7 @@ class RemoteHost:
def using_docker(self) -> bool:
return self.container_id is not None
def run_cmd(self, args: Union[str, List[str]]) -> None:
def run_cmd(self, args: Union[str, list[str]]) -> None:
if not self.using_docker():
return self.run_ssh_cmd(args)
assert self.container_id is not None
@ -178,7 +178,7 @@ class RemoteHost:
if rc != 0:
raise subprocess.CalledProcessError(rc, docker_cmd)
def check_output(self, args: Union[str, List[str]]) -> str:
def check_output(self, args: Union[str, list[str]]) -> str:
if not self.using_docker():
return self.check_ssh_output(args)
assert self.container_id is not None
@ -230,7 +230,7 @@ class RemoteHost:
)
self.download_file(remote_file, local_file)
def list_dir(self, path: str) -> List[str]:
def list_dir(self, path: str) -> list[str]:
return self.check_output(["ls", "-1", path]).split("\n")
@ -327,7 +327,7 @@ def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None
]
)
host.run_cmd(
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v24.09 {git_clone_flags}"
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"
)
host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")
@ -358,7 +358,7 @@ def checkout_repo(
branch: str = "main",
url: str,
git_clone_flags: str,
mapping: Dict[str, Tuple[str, str]],
mapping: dict[str, tuple[str, str]],
) -> Optional[str]:
for prefix in mapping:
if not branch.startswith(prefix):
@ -619,9 +619,11 @@ def build_torchaudio(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \
host.run_cmd(
f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \
&& ./packaging/ffmpeg/build.sh \
&& {build_vars} python3 setup.py bdist_wheel")
&& {build_vars} python3 setup.py bdist_wheel"
)
wheel_name = host.list_dir("audio/dist")[0]
embed_libgomp(host, use_conda, os.path.join("audio", "dist", wheel_name))
@ -655,18 +657,6 @@ def configure_system(
"sudo apt-get install -y python3-dev python3-yaml python3-setuptools python3-wheel python3-pip"
)
host.run_cmd("pip3 install dataclasses typing-extensions")
# Install and switch to gcc-8 on Ubuntu-18.04
if not host.using_docker() and host.ami == ubuntu18_04_ami and compiler == "gcc-8":
host.run_cmd("sudo apt-get install -y g++-8 gfortran-8")
host.run_cmd(
"sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 100"
)
host.run_cmd(
"sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 100"
)
host.run_cmd(
"sudo update-alternatives --install /usr/bin/gfortran gfortran /usr/bin/gfortran-8 100"
)
if not use_conda:
print("Installing Cython + numpy from PyPy")
host.run_cmd("sudo pip3 install Cython")
@ -679,7 +669,7 @@ def build_domains(
branch: str = "main",
use_conda: bool = True,
git_clone_flags: str = "",
) -> Tuple[str, str, str, str]:
) -> tuple[str, str, str, str]:
vision_wheel_name = build_torchvision(
host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags
)
@ -706,7 +696,7 @@ def start_build(
pytorch_build_number: Optional[str] = None,
shallow_clone: bool = True,
enable_mkldnn: bool = False,
) -> Tuple[str, str, str, str, str]:
) -> tuple[str, str, str, str, str]:
git_clone_flags = " --depth 1 --shallow-submodules" if shallow_clone else ""
if host.using_docker() and not use_conda:
print("Auto-selecting conda option for docker images")
@ -757,7 +747,7 @@ def start_build(
version = host.check_output("cat pytorch/version.txt").strip()[:-2]
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1"
if branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1"
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
if enable_mkldnn:
@ -930,9 +920,9 @@ def parse_arguments():
parser.add_argument("--debug", action="store_true")
parser.add_argument("--build-only", action="store_true")
parser.add_argument("--test-only", type=str)
parser.add_argument(
"--os", type=str, choices=list(os_amis.keys()), default="ubuntu20_04"
)
group = parser.add_mutually_exclusive_group()
group.add_argument("--os", type=str, choices=list(os_amis.keys()))
group.add_argument("--ami", type=str)
parser.add_argument(
"--python-version",
type=str,
@ -962,7 +952,13 @@ def parse_arguments():
if __name__ == "__main__":
args = parse_arguments()
ami = os_amis[args.os]
ami = (
args.ami
if args.ami is not None
else os_amis[args.os]
if args.os is not None
else ubuntu20_04_ami
)
keyfile_path, key_name = compute_keyfile_path(args.key_name)
if args.list_instances:
@ -1016,7 +1012,7 @@ if __name__ == "__main__":
install_condaforge_python(host, args.python_version)
sys.exit(0)
python_version = args.python_version if args.python_version is not None else "3.8"
python_version = args.python_version if args.python_version is not None else "3.9"
if args.use_torch_from_pypi:
configure_system(host, compiler=args.compiler, python_version=python_version)

View File

@ -10,5 +10,3 @@ example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are
built on Jenkins and are used in triggered builds already have this
environment variable set in their manifest. Also see
`./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.
Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

View File

@ -13,10 +13,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
echo 'Skipping tests'
exit 0
fi
if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then
# temporary to locate some kernel issues on the CI nodes
export HSAKMT_DEBUG_LEVEL=4
fi
# These additional packages are needed for circleci ROCm builds.
if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
# Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by

View File

@ -34,5 +34,5 @@ See `build.sh` for valid build environments (it's the giant switch).
./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
# Set flags (see build.sh) and build image
sudo bash -c 'PROTOBUF=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
```

View File

@ -1,5 +1,6 @@
ARG CUDA_VERSION=12.4
ARG BASE_TARGET=cuda${CUDA_VERSION}
ARG ROCM_IMAGE=rocm/dev-almalinux-8:6.3-complete
FROM amd64/almalinux:8 as base
ENV LC_ALL en_US.UTF-8
@ -8,10 +9,6 @@ ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=11
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum -y update
RUN yum -y install epel-release
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel openssl-devel yum-utils autoconf automake make gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
@ -41,9 +38,12 @@ RUN bash ./install_conda.sh && rm install_conda.sh
# Install CUDA
FROM base as cuda
ARG CUDA_VERSION=12.4
ARG CUDA_VERSION=12.6
RUN rm -rf /usr/local/cuda-*
ADD ./common/install_cuda.sh install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
# Preserve CUDA_VERSION for the builds
ENV CUDA_VERSION=${CUDA_VERSION}
@ -54,18 +54,20 @@ FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
ENV DESIRED_CUDA=11.8
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
ENV DESIRED_CUDA=12.1
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
ENV DESIRED_CUDA=12.4
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
ENV DESIRED_CUDA=12.6
FROM cuda as cuda12.8
RUN bash ./install_cuda.sh 12.8
ENV DESIRED_CUDA=12.8
FROM ${ROCM_IMAGE} as rocm
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
ENV MKLROOT /opt/intel
# Install MNIST test data
FROM base as mnist
ADD ./common/install_mnist.sh install_mnist.sh
@ -73,9 +75,8 @@ RUN bash ./install_mnist.sh
FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
COPY --from=cuda12.4 /usr/local/cuda-12.8 /usr/local/cuda-12.8
# Final step
FROM ${BASE_TARGET} as final

View File

@ -1,82 +1,70 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
set -exou pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
echo "Usage: $0 IMAGENAME:ARCHTAG"
exit 1
fi
DOCKER_IMAGE_NAME="pytorch/${image}"
# Go from imagename:tag to tag
DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')
CUDA_VERSION=""
ROCM_VERSION=""
EXTRA_BUILD_ARGS=""
if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then
# extract cuda version from image name and tag. e.g. manylinux2_28-builder:cuda12.8 returns 12.8
CUDA_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')
EXTRA_BUILD_ARGS="--build-arg CUDA_VERSION=${CUDA_VERSION}"
elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then
# extract rocm version from image name and tag. e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4
ROCM_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')
EXTRA_BUILD_ARGS="--build-arg ROCM_IMAGE=rocm/dev-almalinux-8:${ROCM_VERSION}-complete"
fi
export DOCKER_BUILDKIT=1
TOPDIR=$(git rev-parse --show-toplevel)
CUDA_VERSION=${CUDA_VERSION:-12.1}
case ${CUDA_VERSION} in
case ${DOCKER_TAG_PREFIX} in
cpu)
BASE_TARGET=base
DOCKER_TAG=cpu
;;
all)
BASE_TARGET=all_cuda
DOCKER_TAG=latest
cuda*)
BASE_TARGET=cuda${CUDA_VERSION}
;;
rocm*)
BASE_TARGET=rocm
;;
*)
BASE_TARGET=cuda${CUDA_VERSION}
DOCKER_TAG=cuda${CUDA_VERSION}
echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"
exit 1
;;
esac
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
(
set -x
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
export DOCKER_BUILDKIT=1
TOPDIR=$(git rev-parse --show-toplevel)
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
docker build \
--target final \
--progress plain \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=11" \
-t ${DOCKER_IMAGE_NAME} \
$@ \
-f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \
${TOPDIR}/.ci/docker/
)
docker build \
--target final \
--progress plain \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
--build-arg "DEVTOOLSET_VERSION=11" \
${EXTRA_BUILD_ARGS} \
-t ${tmp_tag} \
$@ \
-f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \
${TOPDIR}/.ci/docker/
if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then
if [ -n "${CUDA_VERSION}" ]; then
# Test that we're using the right CUDA compiler
(
set -x
docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"
)
fi
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH:-}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE_NAME}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
docker run --rm "${tmp_tag}" nvcc --version | grep "cuda_${CUDA_VERSION}"
fi

View File

@ -1,5 +0,0 @@
0.8b
manylinux_2_28
rocm6.2
6f8cbcac8a92775291bb1ba8f514d4beb350baf4
e938def5d32869fe2e00aec0300f354c9f157867bebdf2e104d732b94cb238d8

View File

@ -1,4 +1,8 @@
#!/bin/bash
# The purpose of this script is to:
# 1. Extract the set of parameters to be used for a docker build based on the provided image name.
# 2. Run docker build with the parameters found in step 1.
# 3. Run the built image and print out the expected and actual versions of packages installed.
set -ex
@ -81,42 +85,28 @@ elif [[ "$image" == *linter* ]]; then
DOCKERFILE="linter/Dockerfile"
fi
# CMake 3.18 is needed to support CUDA17 language variant
CMAKE_VERSION=3.18.5
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
if [[ "$image" == *rocm* ]]; then
_UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
fi
tag=$(echo $image | awk -F':' '{print $2}')
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
case "$tag" in
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)
@ -124,43 +114,10 @@ case "$image" in
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
@ -169,13 +126,10 @@ case "$image" in
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
@ -184,13 +138,57 @@ case "$image" in
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
@ -199,149 +197,81 @@ case "$image" in
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3-clang10-onnx)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
ONNX=yes
;;
pytorch-linux-focal-py3.9-clang10)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
VISION=yes
VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3.11-clang10)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
VISION=yes
VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-1-py3)
pytorch-linux-jammy-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.1
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.2.4
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
XPU_VERSION=0.5
ROCM_VERSION=6.3
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
ROCM_VERSION=6.4
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-xpu-2025.0-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
XPU_VERSION=2025.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-xpu-2025.1-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.1
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
INDUCTOR_BENCHMARKS=yes
@ -351,40 +281,30 @@ case "$image" in
CUDA_VERSION=11.8
CUDNN_VERSION=9
CLANG_VERSION=12
PROTOBUF=yes
DB=yes
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang12-asan)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=12
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang15-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=15
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3-clang18-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=18
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
UNINSTALL_DILL=yes
@ -392,44 +312,36 @@ case "$image" in
pytorch-linux-jammy-py3-clang12-executorch)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=12
CONDA_CMAKE=yes
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
HALIDE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
TRITON_CPU=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
ANACONDA_PYTHON_VERSION=3.9
CONDA_CMAKE=yes
PYTHON_VERSION=3.9
;;
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)
ANACONDA_PYTHON_VERSION=3.9
PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -438,10 +350,7 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -449,8 +358,6 @@ case "$image" in
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
DB=yes
VISION=yes
echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *py* ]]; then
@ -466,8 +373,7 @@ case "$image" in
TRITON=yes
# To ensure that any ROCm config will build using conda cmake
# and thus have LAPACK/MKL enabled
CONDA_CMAKE=yes
fi
fi
if [[ "$image" == *centos7* ]]; then
NINJA_VERSION=1.10.2
fi
@ -483,9 +389,6 @@ case "$image" in
if [[ "$image" == *glibc* ]]; then
extract_version_from_image_name glibc GLIBC_VERSION
fi
if [[ "$image" == *cmake* ]]; then
extract_version_from_image_name cmake CMAKE_VERSION
fi
;;
esac
@ -499,14 +402,20 @@ if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
fi
fi
no_cache_flag=""
progress_flag=""
# Do not use cache and progress=plain when in CI
if [[ -n "${CI:-}" ]]; then
no_cache_flag="--no-cache"
progress_flag="--progress=plain"
fi
# Build image
docker build \
--no-cache \
--progress=plain \
${no_cache_flag} \
${progress_flag} \
--build-arg "BUILD_ENVIRONMENT=${image}" \
--build-arg "PROTOBUF=${PROTOBUF:-}" \
--build-arg "LLVMDEV=${LLVMDEV:-}" \
--build-arg "DB=${DB:-}" \
--build-arg "VISION=${VISION:-}" \
--build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \
--build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \
@ -514,22 +423,19 @@ docker build \
--build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \
--build-arg "CLANG_VERSION=${CLANG_VERSION}" \
--build-arg "ANACONDA_PYTHON_VERSION=${ANACONDA_PYTHON_VERSION}" \
--build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \
--build-arg "GCC_VERSION=${GCC_VERSION}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
--build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \
--build-arg "SWIFTSHADER=${SWIFTSHADER}" \
--build-arg "CMAKE_VERSION=${CMAKE_VERSION:-}" \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942}" \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "UCX_COMMIT=${UCX_COMMIT}" \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \
--build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \
--build-arg "TRITON=${TRITON}" \
--build-arg "TRITON_CPU=${TRITON_CPU}" \
--build-arg "ONNX=${ONNX}" \
@ -538,6 +444,7 @@ docker build \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "HALIDE=${HALIDE}" \
--build-arg "XPU_VERSION=${XPU_VERSION}" \
--build-arg "UNINSTALL_DILL=${UNINSTALL_DILL}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \
--build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \
@ -555,7 +462,7 @@ docker build \
UBUNTU_VERSION=$(echo ${UBUNTU_VERSION} | sed 's/-rc$//')
function drun() {
docker run --rm "$tmp_tag" $*
docker run --rm "$tmp_tag" "$@"
}
if [[ "$OS" == "ubuntu" ]]; then
@ -603,3 +510,23 @@ if [ -n "$KATEX" ]; then
exit 1
fi
fi
HAS_TRITON=$(drun python -c "import triton" > /dev/null 2>&1 && echo "yes" || echo "no")
if [[ -n "$TRITON" || -n "$TRITON_CPU" ]]; then
if [ "$HAS_TRITON" = "no" ]; then
echo "expecting triton to be installed, but it is not"
exit 1
fi
elif [ "$HAS_TRITON" = "yes" ]; then
echo "expecting triton to not be installed, but it is"
exit 1
fi
# Sanity check cmake version. Executorch reinstalls cmake and I'm not sure if
# they support 4.0.0 yet, so exclude them from this check.
CMAKE_VERSION=$(drun cmake --version)
if [[ "$EXECUTORCH" != *yes* && "$CMAKE_VERSION" != *4.* ]]; then
echo "CMake version is not 4.0.0:"
drun cmake --version
exit 1
fi

View File

@ -17,9 +17,8 @@ RUN bash ./install_base.sh && rm install_base.sh
# Update CentOS git version
RUN yum -y remove git
RUN yum -y remove git-*
RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm || \
(yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm && \
sed -i "s/packages.endpoint/packages.endpointdev/" /etc/yum.repos.d/endpoint.repo)
RUN yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm && \
sed -i 's/packages.endpoint/packages.endpointdev/' /etc/yum.repos.d/endpoint.repo
RUN yum install -y git
# Install devtoolset
@ -40,7 +39,6 @@ RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
@ -48,20 +46,6 @@ COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
@ -75,7 +59,7 @@ COPY ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}
RUN rm install_rocm_magma.sh
COPY ./common/install_amdsmi.sh install_amdsmi.sh
RUN bash ./install_amdsmi.sh
@ -89,12 +73,6 @@ ENV MAGMA_HOME /opt/rocm/magma
ENV LANG en_US.utf8
ENV LC_ALL en_US.utf8
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh
@ -113,13 +91,6 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton (Early fail)
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -1 +1 @@
6f638937d64e3396793956d75ee3e14802022745
b173722085b3f555d6ba4533d6bbaddfd7c71144

View File

@ -0,0 +1 @@
v2.21.5-1

View File

@ -0,0 +1 @@
v2.26.5-1

View File

@ -1 +1 @@
ac3470188b914c5d7a5058a7e28b9eb685a62427
5d535d7a2d4b435b1b5c1177fd8f04a12b942b9a

View File

@ -1 +1 @@
e98b6fcb8df5b44eb0d0addb6767c573d37ba024
0bcc8265e677e5321606a3311bf71470f14456a8

View File

@ -1 +1 @@
35c6c7c6284582b3f41c71c150e11b517acf074a
96316ce50fade7e209553aba4898cd9b82aab83b

View File

@ -1,7 +1,7 @@
set -euo pipefail
readonly version=v24.04
readonly src_host=https://review.mlplatform.org/ml
readonly version=v25.02
readonly src_host=https://github.com/ARM-software
readonly src_repo=ComputeLibrary
# Clone ACL

View File

@ -1,23 +0,0 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
TARBALL='aotriton.tar.gz'
# This read command alwasy returns with exit code 1
read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true
ARCH=$(uname -m)
AOTRITON_INSTALL_PREFIX="$1"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"
cd "${AOTRITON_INSTALL_PREFIX}"
# Must use -L to follow redirects
curl -L --retry 3 -o "${TARBALL}" "${AOTRITON_URL}"
ACTUAL_SHA256=$(sha256sum "${TARBALL}" | cut -d " " -f 1)
if [ "${SHA256}" != "${ACTUAL_SHA256}" ]; then
echo -n "Error: The SHA256 of downloaded tarball is ${ACTUAL_SHA256},"
echo " which does not match the expected value ${SHA256}."
exit
fi
tar xf "${TARBALL}" && rm -rf "${TARBALL}"

View File

@ -32,8 +32,12 @@ install_ubuntu() {
# HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes
# See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729
# TODO: Eliminate this hack, we should not relay on apt-get installation
# See https://github.com/pytorch/pytorch/issues/144768
if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then
maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then
maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
else
maybe_libnccl_dev=""
fi
@ -95,9 +99,6 @@ install_centos() {
ccache_deps="asciidoc docbook-dtds docbook-style-xsl libxslt"
numpy_deps="gcc-gfortran"
# Note: protobuf-c-{compiler,devel} on CentOS are too old to be used
# for Caffe2. That said, we still install them to make sure the build
# system opts to build/use protoc and libprotobuf from third-party.
yum install -y \
$ccache_deps \
$numpy_deps \

View File

@ -9,7 +9,7 @@ install_ubuntu() {
# Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``
apt-get install -y cargo
echo "Checking out sccache repo"
git clone https://github.com/mozilla/sccache -b v0.8.2
git clone https://github.com/mozilla/sccache -b v0.10.0
cd sccache
echo "Building sccache"
cargo build --release
@ -36,11 +36,7 @@ sed -e 's|PATH="\(.*\)"|PATH="/opt/cache/bin:\1"|g' -i /etc/environment
export PATH="/opt/cache/bin:$PATH"
# Setup compiler cache
if [ -n "$ROCM_VERSION" ]; then
curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache
else
install_ubuntu
fi
install_ubuntu
chmod a+x /opt/cache/bin/sccache
function write_sccache_stub() {

View File

@ -4,16 +4,10 @@ set -ex
if [ -n "$CLANG_VERSION" ]; then
if [[ $CLANG_VERSION == 9 && $UBUNTU_VERSION == 18.04 ]]; then
sudo apt-get update
# gpg-agent is not available by default on 18.04
sudo apt-get install -y --no-install-recommends gpg-agent
wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-${CLANG_VERSION} main"
elif [[ $UBUNTU_VERSION == 22.04 ]]; then
if [[ $UBUNTU_VERSION == 22.04 ]]; then
# work around ubuntu apt-get conflicts
sudo apt-get -y -f install
wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
if [[ $CLANG_VERSION == 18 ]]; then
apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"
fi
@ -41,7 +35,7 @@ if [ -n "$CLANG_VERSION" ]; then
# clang's packaging is a little messed up (the runtime libs aren't
# added into the linker path), so give it a little help
clang_lib=("/usr/lib/llvm-$CLANG_VERSION/lib/clang/"*"/lib/linux")
echo "$clang_lib" > /etc/ld.so.conf.d/clang.conf
echo "$clang_lib" >/etc/ld.so.conf.d/clang.conf
ldconfig
# Cleanup package manager

View File

@ -1,31 +0,0 @@
#!/bin/bash
set -ex
[ -n "$CMAKE_VERSION" ]
# Remove system cmake install so it won't get used instead
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
apt-get remove cmake -y
;;
centos)
yum remove cmake -y
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
# Turn 3.6.3 into v3.6
path=$(echo "${CMAKE_VERSION}" | sed -e 's/\([0-9].[0-9]\+\).*/v\1/')
file="cmake-${CMAKE_VERSION}-Linux-x86_64.tar.gz"
# Download and install specific CMake version in /usr/local
pushd /tmp
curl -Os --retry 3 "https://cmake.org/files/${path}/${file}"
tar -C /usr/local --strip-components 1 --no-same-owner -zxf cmake-*.tar.gz
rm -f cmake-*.tar.gz
popd

View File

@ -7,7 +7,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
@ -62,11 +62,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 -c conda-forge
conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.28=*openmp*"
conda_install "openblas==0.3.29=*openmp*"
else
conda_install "mkl=2021.4.0 mkl-include=2021.4.0"
fi
@ -75,19 +75,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# and libpython-static for torch deploy
conda_install llvmdev=8.0.0 "libpython-static=${ANACONDA_PYTHON_VERSION}"
# Use conda cmake in some cases. Conda cmake will be newer than our supported
# min version (3.5 for xenial and 3.10 for bionic), so we only do it in those
# following builds that we know should use conda. Specifically, Ubuntu bionic
# and focal cannot find conda mkl with stock cmake, so we need a cmake from conda
if [ -n "${CONDA_CMAKE}" ]; then
conda_install cmake
fi
# Magma package names are concatenation of CUDA major and minor ignoring revision
# I.e. magma-cuda102 package corresponds to CUDA_VERSION=10.2 and CUDA_VERSION=10.2.89
# Magma is installed from a tarball in the ossci-linux bucket into the conda env
if [ -n "$CUDA_VERSION" ]; then
${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION}) ${ANACONDA_PYTHON_VERSION}
conda_run ${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION})
fi
# Install some other packages, including those needed for Python test reporting

View File

@ -3,11 +3,11 @@
set -uex -o pipefail
PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads # @lint-ignore
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}
function check_var {
if [ -z "$1" ]; then
@ -70,7 +70,7 @@ function do_cpython_build {
# install setuptools since python 3.12 is required to use distutils
${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
ln -s ${prefix} /opt/python/${abi_tag}
ln -sf ${prefix} /opt/python/${abi_tag}
}
function build_cpython {

View File

@ -2,183 +2,82 @@
set -ex
NCCL_VERSION=v2.21.5-1
CUDNN_VERSION=9.5.1.17
arch_path=''
targetarch=${TARGETARCH:-$(uname -m)}
if [ ${targetarch} = 'amd64' ] || [ "${targetarch}" = 'x86_64' ]; then
arch_path='x86_64'
else
arch_path='sbsa'
fi
function install_cusparselt_040 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
function install_cuda {
version=$1
runfile=$2
major_minor=${version%.*}
rm -rf /usr/local/cuda-${major_minor} /usr/local/cuda
if [[ ${arch_path} == 'sbsa' ]]; then
runfile="${runfile}_sbsa"
fi
runfile="${runfile}.run"
wget -q https://developer.download.nvidia.com/compute/cuda/${version}/local_installers/${runfile} -O ${runfile}
chmod +x ${runfile}
./${runfile} --toolkit --silent
rm -f ${runfile}
rm -f /usr/local/cuda && ln -s /usr/local/cuda-${major_minor} /usr/local/cuda
}
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
function install_cudnn {
cuda_major_version=$1
cudnn_version=$2
mkdir tmp_cudnn && cd tmp_cudnn
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
filepath="cudnn-linux-${arch_path}-${cudnn_version}_cuda${cuda_major_version}-archive"
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-${arch_path}/${filepath}.tar.xz
tar xf ${filepath}.tar.xz
cp -a ${filepath}/include/* /usr/local/cuda/include/
cp -a ${filepath}/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
}
function install_118 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"
rm -rf /usr/local/cuda-11.8 /usr/local/cuda
# install CUDA 11.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
chmod +x cuda_11.8.0_520.61.05_linux.run
./cuda_11.8.0_520.61.05_linux.run --toolkit --silent
rm -f cuda_11.8.0_520.61.05_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-11.8 /usr/local/cuda
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.4.0"
install_cuda 11.8.0 cuda_11.8.0_520.61.05_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
install_cudnn 11 $CUDNN_VERSION
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
CUDA_VERSION=11.8 bash install_nccl.sh
install_cusparselt_040
ldconfig
}
function install_121 {
echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.1 /usr/local/cuda
# install CUDA 12.1.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
chmod +x cuda_12.1.1_530.30.02_linux.run
./cuda_12.1.1_530.30.02_linux.run --toolkit --silent
rm -f cuda_12.1.1_530.30.02_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
CUDA_VERSION=11.8 bash install_cusparselt.sh
ldconfig
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
chmod +x cuda_12.4.1_550.54.15_linux.run
./cuda_12.4.1_550.54.15_linux.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"
install_cuda 12.4.1 cuda_12.4.1_550.54.15_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
install_cudnn 12 $CUDNN_VERSION
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
CUDA_VERSION=12.4 bash install_nccl.sh
install_cusparselt_062
CUDA_VERSION=12.4 bash install_cusparselt.sh
ldconfig
}
function install_126 {
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.6 /usr/local/cuda
# install CUDA 12.6.3 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
chmod +x cuda_12.6.3_560.35.05_linux.run
./cuda_12.6.3_560.35.05_linux.run --toolkit --silent
rm -f cuda_12.6.3_560.35.05_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda
CUDNN_VERSION=9.5.1.17
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
install_cudnn 12 $CUDNN_VERSION
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
CUDA_VERSION=12.6 bash install_nccl.sh
install_cusparselt_063
CUDA_VERSION=12.6 bash install_cusparselt.sh
ldconfig
}
@ -214,37 +113,6 @@ function prune_118 {
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_121 {
echo "Pruning CUDA 12.1"
#####################################################################################
# CUDA 12.1 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.1/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
@ -313,18 +181,34 @@ function prune_126 {
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
function install_128 {
CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
# install CUDA 12.8.0 in the same container
install_cuda 12.8.0 cuda_12.8.0_570.86.10_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
CUDA_VERSION=12.8 bash install_nccl.sh
CUDA_VERSION=12.8 bash install_cusparselt.sh
ldconfig
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
;;
12.1) install_121; prune_121
;;
12.4) install_124; prune_124
;;
12.6) install_126; prune_126
;;
12.8) install_128;
;;
*) echo "bad argument $1"; exit 1
;;
esac

View File

@ -1,175 +0,0 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
NCCL_VERSION=v2.21.5-1
CUDNN_VERSION=9.5.1.17
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run
./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_062
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function install_126 {
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.6 /usr/local/cuda
# install CUDA 12.6.3 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run
chmod +x cuda_12.6.3_560.35.05_linux_sbsa.run
./cuda_12.6.3_560.35.05_linux_sbsa.run --toolkit --silent
rm -f cuda_12.6.3_560.35.05_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_063
ldconfig
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
# CUDA 12.6 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.6 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.6/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
;;
12.6) install_126; prune_126
;;
*) echo "bad argument $1"; exit 1
;;
esac
shift
done

View File

@ -4,7 +4,9 @@ if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"

View File

@ -5,7 +5,15 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
@ -13,17 +21,11 @@ if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
else
echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"
fi
tar xf ${CUSPARSELT_NAME}.tar.xz

View File

@ -1,38 +0,0 @@
#!/bin/bash
set -ex
install_ubuntu() {
apt-get update
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
install_centos() {
# Need EPEL for many packages we depend on.
# See http://fedoraproject.org/wiki/EPEL
yum --enablerepo=extras install -y epel-release
# Cleanup
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
# Install base packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -13,7 +13,7 @@ clone_executorch() {
# and fetch the target commit
pushd executorch
git checkout "${EXECUTORCH_PINNED_COMMIT}"
git submodule update --init
git submodule update --init --recursive
popd
chown -R jenkins executorch
@ -37,7 +37,12 @@ install_conda_dependencies() {
install_pip_dependencies() {
pushd executorch
as_jenkins bash install_requirements.sh --pybind xnnpack
as_jenkins bash install_executorch.sh
# A workaround, ExecuTorch has moved to numpy 2.0 which is not compatible with the current
# numba and scipy version used in PyTorch CI
conda_run pip uninstall -y numba scipy
popd
}
@ -45,10 +50,9 @@ setup_executorch() {
pushd executorch
export PYTHON_EXECUTABLE=python
export EXECUTORCH_BUILD_PYBIND=ON
export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
export CMAKE_ARGS="-DEXECUTORCH_BUILD_PYBIND=ON -DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
as_jenkins .ci/scripts/setup-linux.sh cmake || true
as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true
popd
}

View File

@ -17,7 +17,7 @@ if [ -n "${UBUNTU_VERSION}" ];then
libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev
fi
conda_install numpy scipy imageio cmake ninja
pip_install numpy scipy imageio cmake ninja
git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git
cmake -DCMAKE_BUILD_TYPE=Release \
@ -35,7 +35,9 @@ git clone https://github.com/halide/Halide.git
pushd Halide
git checkout ${COMMIT} && git submodule update --init --recursive
pip_install -r requirements.txt
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build
# NOTE: pybind has a requirement for cmake > 3.5 so set the minimum cmake version here with a flag
# Context: https://github.com/pytorch/pytorch/issues/150420
cmake -G Ninja -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DCMAKE_BUILD_TYPE=Release -S . -B build
cmake --build build
test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3
cmake --install build --prefix ${CONDA_PREFIX}

View File

@ -14,16 +14,9 @@ function install_timm() {
local commit
commit=$(get_pinned_commit timm)
# TODO (huydhn): There is no torchvision release on 3.13 when I write this, so
# I'm using nightly here instead. We just need to package to be able to install
# TIMM. Removing this once vision has a release on 3.13
if [[ "${ANACONDA_PYTHON_VERSION}" == "3.13" ]]; then
pip_install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124
fi
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y cmake torch torchvision triton
conda_run pip uninstall -y torch torchvision triton
}
# Pango is needed for weasyprint which is needed for doctr

View File

@ -2,8 +2,6 @@
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
if [ -n "${UBUNTU_VERSION}" ]; then
apt update
apt-get install -y clang doxygen git graphviz nodejs npm libtinfo5
@ -15,8 +13,8 @@ chown -R jenkins pytorch
pushd pytorch
# Install all linter dependencies
pip_install -r requirements.txt
conda_run lintrunner init
pip install -r requirements.txt
lintrunner init
# Cache .lintbin directory as part of the Docker image
cp -r .lintbin /tmp

View File

@ -1,26 +1,23 @@
#!/usr/bin/env bash
# Script that replaces the magma install from a conda package
# Script that installs magma from tarball inside conda environment.
# It replaces anaconda magma-cuda package which is no longer published.
# Execute it inside active conda environment.
# See issue: https://github.com/pytorch/pytorch/issues/138506
set -eou pipefail
function do_install() {
cuda_version_nodot=${1/./}
anaconda_python_version=$2
cuda_version_nodot=${1/./}
anaconda_dir=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
anaconda_dir="/opt/conda/envs/py_${anaconda_python_version}"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mv include/* "${anaconda_dir}/include/"
mv lib/* "${anaconda_dir}/lib"
popd
)
}
do_install $1 $2
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mv include/* "${anaconda_dir}/include/"
mv lib/* "${anaconda_dir}/lib"
popd
)

View File

@ -0,0 +1,26 @@
#!/bin/bash
set -ex
NCCL_VERSION=""
if [[ ${CUDA_VERSION:0:2} == "11" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)
else
echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"
exit 1
fi
if [[ -n "${NCCL_VERSION}" ]]; then
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
pushd nccl
make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
popd
rm -rf nccl
ldconfig
fi

View File

@ -4,10 +4,15 @@ set -ex
[ -n "$NINJA_VERSION" ]
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"
arch=$(uname -m)
if [ "$arch" == "aarch64" ]; then
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux-aarch64.zip"
else
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"
fi
pushd /tmp
wget --no-verbose --output-document=ninja-linux.zip "$url"
unzip ninja-linux.zip -d /usr/local/bin
rm -f ninja-linux.zip
popd
popd

View File

@ -31,15 +31,15 @@ pip_install \
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.16.2
pip_install onnxscript==0.1.0.dev20241124 --no-deps
pip_install onnx==1.17.0
pip_install onnxscript==0.2.2 --no-deps
# required by onnxscript
pip_install ml_dtypes
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/
IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"
as_jenkins echo 'import transformers; transformers.GPTJForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gptj");' > "${IMPORT_SCRIPT_FILENAME}"
# Need a PyTorch version for transformers to work
pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

View File

@ -4,7 +4,7 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.28 --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="

View File

@ -1,19 +0,0 @@
#!/bin/bash
set -ex
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or
# else it will fail with
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
tar -xvz --no-same-owner -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
NPROC=$[$(nproc) - 2]
pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig
popd
rm -rf $pb_dir

View File

@ -0,0 +1,15 @@
#!/bin/bash
set -ex
apt-get update
# Use deadsnakes in case we need an older python version
sudo add-apt-repository ppa:deadsnakes/ppa
apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python3-pip python${PYTHON_VERSION}-venv
# Use a venv because uv and some other package managers don't support --user install
ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
python -m venv /var/lib/jenkins/ci_env
source /var/lib/jenkins/ci_env/bin/activate
python -mpip install --upgrade pip
python -mpip install -r /opt/requirements-ci.txt

View File

@ -8,10 +8,6 @@ ver() {
install_ubuntu() {
apt-get update
if [[ $UBUNTU_VERSION == 18.04 ]]; then
# gpg-agent is not available by default on 18.04
apt-get install -y --no-install-recommends gpg-agent
fi
if [[ $UBUNTU_VERSION == 20.04 ]]; then
# gpg-agent is not available by default on 20.04
apt-get install -y --no-install-recommends gpg-agent
@ -23,6 +19,13 @@ install_ubuntu() {
apt-get install -y libc++1
apt-get install -y libc++abi1
# Make sure rocm packages from repo.radeon.com have highest priority
cat << EOF > /etc/apt/preferences.d/rocm-pin-600
Package: *
Pin: release o=repo.radeon.com
Pin-Priority: 600
EOF
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
@ -62,6 +65,30 @@ install_ubuntu() {
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
# ROCm 6.4 did not yet fix the regression, also HIP branch names are different
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]] || [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
fi
# clr build needs CppHeaderParser but can only find it using conda's python
/opt/conda/bin/python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}-statco-hotfix
mkdir -p clr/build
pushd clr/build
cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
make -j
cp hipamd/lib/libamdhip64.so.${VER_STR}.* /opt/rocm/lib/libamdhip64.so.${VER_STR}.*
popd
rm -rf HIP clr
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

View File

@ -115,7 +115,7 @@ index a5007ffc..13fa07fc 100644
if (!fp) {
- fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,
- strerror(errno));
+ fprintf(stderr, "amdgpu.ids: No such file or directory\n");
+ //fprintf(stderr, "amdgpu.ids: No such file or directory\n");
return;
}

View File

@ -1,50 +1,32 @@
#!/bin/bash
# Script used in CI and CD pipeline
#!/usr/bin/env bash
# Script used only in CD pipeline
set -ex
set -eou pipefail
# Magma build scripts need `python`
ln -sf /usr/bin/python3 /usr/bin/python
function do_install() {
rocm_version=$1
rocm_version_nodot=${1//./}
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
almalinux)
yum install -y gcc-gfortran
;;
*)
echo "No preinstalls to build magma..."
;;
esac
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6
magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}
rocm_dir="/opt/rocm"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
if tar -xvf "${magma_archive}"
then
mkdir -p "${rocm_dir}/magma"
mv include "${rocm_dir}/magma/include"
mv lib "${rocm_dir}/magma/lib"
else
echo "${magma_archive} not found, skipping magma install"
fi
popd
)
}
# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git
pushd magma
# Version 2.7.2 + ROCm related updates
git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then
echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc
fi
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc
done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"
make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"
popd
mv magma /opt/rocm
do_install $1

View File

@ -1,24 +0,0 @@
#!/bin/bash
set -ex
[ -n "${SWIFTSHADER}" ]
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
# SwiftShader
_swiftshader_dir=/var/lib/jenkins/swiftshader
_swiftshader_file_targz=swiftshader-abe07b943-prebuilt.tar.gz
mkdir -p $_swiftshader_dir
_tmp_swiftshader_targz="/tmp/${_swiftshader_file_targz}"
curl --silent --show-error --location --fail --retry 3 \
--output "${_tmp_swiftshader_targz}" "$_https_amazon_aws/${_swiftshader_file_targz}"
tar -C "${_swiftshader_dir}" -xzf "${_tmp_swiftshader_targz}"
export VK_ICD_FILENAMES="${_swiftshader_dir}/build/Linux/vk_swiftshader_icd.json"

View File

@ -2,14 +2,16 @@
set -ex
mkdir -p /opt/triton
if [ -z "${TRITON}" ] && [ -z "${TRITON_CPU}" ]; then
echo "TRITON and TRITON_CPU are not set. Exiting..."
exit 0
fi
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
get_conda_version() {
as_jenkins conda list -n py_$ANACONDA_PYTHON_VERSION | grep -w $* | head -n 1 | awk '{print $2}'
}
conda_reinstall() {
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*
get_pip_version() {
conda_run pip list | grep -w $* | head -n 1 | awk '{print $2}'
}
if [ -n "${XPU_VERSION}" ]; then
@ -31,11 +33,9 @@ if [ -n "${UBUNTU_VERSION}" ];then
apt-get install -y gpg-agent
fi
if [ -n "${CONDA_CMAKE}" ]; then
# Keep the current cmake and numpy version here, so we can reinstall them later
CMAKE_VERSION=$(get_conda_version cmake)
NUMPY_VERSION=$(get_conda_version numpy)
fi
# Keep the current cmake and numpy version here, so we can reinstall them later
CMAKE_VERSION=$(get_pip_version cmake)
NUMPY_VERSION=$(get_pip_version numpy)
if [ -z "${MAX_JOBS}" ]; then
export MAX_JOBS=$(nproc)
@ -52,6 +52,7 @@ cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
as_jenkins git submodule update --init --recursive
cd python
pip_install pybind11==2.13.6
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py
@ -60,28 +61,35 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 pip_install -e .
CXX=g++-9 conda_run python setup.py bdist_wheel
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 pip_install -e .
CXX=g++-9 conda_run python setup.py bdist_wheel
else
pip_install -e .
conda_run python setup.py bdist_wheel
fi
if [ -n "${CONDA_CMAKE}" ]; then
# TODO: This is to make sure that the same cmake and numpy version from install conda
# script is used. Without this step, the newer cmake version (3.25.2) downloaded by
# triton build step via pip will fail to detect conda MKL. Once that issue is fixed,
# this can be removed.
#
# The correct numpy version also needs to be set here because conda claims that it
# causes inconsistent environment. Without this, conda will attempt to install the
# latest numpy version, which fails ASAN tests with the following import error: Numba
# needs NumPy 1.20 or less.
conda_reinstall cmake="${CMAKE_VERSION}"
# Note that we install numpy with pip as conda might not have the version we want
pip_install --force-reinstall numpy=="${NUMPY_VERSION}"
# Copy the wheel to /opt for multi stage docker builds
cp dist/*.whl /opt/triton
# Install the wheel for docker builds that don't use multi stage
pip_install dist/*.whl
# TODO: This is to make sure that the same cmake and numpy version from install conda
# script is used. Without this step, the newer cmake version (3.25.2) downloaded by
# triton build step via pip will fail to detect conda MKL. Once that issue is fixed,
# this can be removed.
#
# The correct numpy version also needs to be set here because conda claims that it
# causes inconsistent environment. Without this, conda will attempt to install the
# latest numpy version, which fails ASAN tests with the following import error: Numba
# needs NumPy 1.20 or less.
# Note that we install numpy with pip as conda might not have the version we want
if [ -n "${CMAKE_VERSION}" ]; then
pip_install "cmake==${CMAKE_VERSION}"
fi
if [ -n "${NUMPY_VERSION}" ]; then
pip_install "numpy==${NUMPY_VERSION}"
fi

View File

@ -8,6 +8,12 @@ else
with_cuda=no
fi
if [[ -d "/opt/rocm" ]]; then
with_rocm=/opt/rocm
else
with_rocm=no
fi
function install_ucx() {
set -ex
git clone --recursive https://github.com/openucx/ucx.git
@ -19,6 +25,7 @@ function install_ucx() {
./configure --prefix=$UCX_HOME \
--enable-mt \
--with-cuda=$with_cuda \
--with-rocm=$with_rocm \
--enable-profiling \
--enable-stats
time make -j
@ -36,12 +43,29 @@ function install_ucc() {
git submodule update --init --recursive
./autogen.sh
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
if [[ -n "$ROCM_VERSION" ]]; then
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
HIP_OFFLOAD="$HIP_OFFLOAD --offload-arch=$arch"
done
else
HIP_OFFLOAD="all-arch-no-native"
fi
./configure --prefix=$UCC_HOME \
--with-ucx=$UCX_HOME \
--with-cuda=$with_cuda \
--with-nvcc-gencode="${NVCC_GENCODE}"
--with-nvcc-gencode="${NVCC_GENCODE}" \
--with-rocm=$with_rocm \
--with-rocm-arch="${HIP_OFFLOAD}"
time make -j
sudo make install

View File

@ -1,24 +0,0 @@
#!/bin/bash
set -ex
[ -n "${VULKAN_SDK_VERSION}" ]
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
_vulkansdk_dir=/var/lib/jenkins/vulkansdk
_tmp_vulkansdk_targz=/tmp/vulkansdk.tar.gz
curl \
--silent \
--show-error \
--location \
--fail \
--retry 3 \
--output "${_tmp_vulkansdk_targz}" "https://ossci-android.s3.amazonaws.com/vulkansdk-linux-x86_64-${VULKAN_SDK_VERSION}.tar.gz"
mkdir -p "${_vulkansdk_dir}"
tar -C "${_vulkansdk_dir}" -xzf "${_tmp_vulkansdk_targz}" --strip-components 1
rm -rf "${_tmp_vulkansdk_targz}"

View File

@ -26,7 +26,7 @@ function install_ubuntu() {
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg.gpg
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg.gpg] \
https://apt.repos.intel.com/${XPU_REPO_NAME} all main" \
https://apt.repos.intel.com/oneapi all main" \
| tee /etc/apt/sources.list.d/oneAPI.list
# Update the packages list and repository index
@ -74,7 +74,7 @@ function install_rhel() {
tee > /etc/yum.repos.d/oneAPI.repo << EOF
[oneAPI]
name=Intel for Pytorch GPU dev repository
baseurl=https://yum.repos.intel.com/${XPU_REPO_NAME}
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
@ -118,7 +118,7 @@ function install_sles() {
https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo
rpm --import https://repositories.intel.com/gpu/intel-graphics.key
# To add the online network network package repository for the Intel Support Packages
zypper addrepo https://yum.repos.intel.com/${XPU_REPO_NAME} oneAPI
zypper addrepo https://yum.repos.intel.com/oneapi oneAPI
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
# The xpu-smi packages
@ -141,10 +141,10 @@ if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
XPU_DRIVER_VERSION=""
fi
XPU_REPO_NAME="intel-for-pytorch-gpu-dev"
XPU_PACKAGES="intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9"
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_REPO_NAME="oneapi"
# Default use Intel® oneAPI Deep Learning Essentials 2025.0
if [[ "$XPU_VERSION" == "2025.1" ]]; then
XPU_PACKAGES="intel-deep-learning-essentials-2025.1"
else
XPU_PACKAGES="intel-deep-learning-essentials-2025.0"
fi

View File

@ -49,6 +49,9 @@ RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM cpu as cuda
ADD ./common/install_cuda.sh install_cuda.sh
ADD ./common/install_magma.sh install_magma.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
ENV CUDA_HOME /usr/local/cuda
FROM cuda as cuda11.8
@ -56,11 +59,6 @@ RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
RUN bash ./install_magma.sh 12.1
RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
@ -71,7 +69,13 @@ RUN bash ./install_cuda.sh 12.6
RUN bash ./install_magma.sh 12.6
RUN ln -sf /usr/local/cuda-12.6 /usr/local/cuda
FROM cuda as cuda12.8
RUN bash ./install_cuda.sh 12.8
RUN bash ./install_magma.sh 12.8
RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda
FROM cpu as rocm
ARG ROCM_VERSION
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
ENV MKLROOT /opt/intel
@ -86,18 +90,11 @@ ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
# gfortran and python needed for building magma from source for ROCm
RUN apt-get update -y && \
apt-get install gfortran -y && \
apt-get install python -y && \
apt-get install python3 python-is-python3 -y && \
apt-get clean
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh
FROM ${BASE_TARGET} as final
COPY --from=openssl /opt/openssl /opt/openssl

View File

@ -1,83 +1,63 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
set -eoux pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
echo "Usage: $0 IMAGENAME:ARCHTAG"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
TOPDIR=$(git rev-parse --show-toplevel)
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
WITH_PUSH=${WITH_PUSH:-}
DOCKER=${DOCKER:-docker}
case ${GPU_ARCH_TYPE} in
# Go from imagename:tag to tag
DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')
GPU_ARCH_VERSION=""
if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then
# extract cuda version from image name. e.g. manylinux2_28-builder:cuda12.8 returns 12.8
GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')
elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then
# extract rocm version from image name. e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4
GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')
fi
case ${DOCKER_TAG_PREFIX} in
cpu)
BASE_TARGET=cpu
DOCKER_TAG=cpu
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
cuda)
cuda*)
BASE_TARGET=cuda${GPU_ARCH_VERSION}
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
rocm)
rocm*)
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx942"
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
echo "ERROR: Unrecognized DOCKER_TAG_PREFIX: ${DOCKER_TAG_PREFIX}"
exit 1
;;
esac
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
(
set -x
DOCKER_BUILDKIT=1 ${DOCKER} build \
--target final \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
${DOCKER} push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"
${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi
DOCKER_BUILDKIT=1 ${DOCKER} build \
--target final \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
-t "${tmp_tag}" \
$@ \
-f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \
"${TOPDIR}/.ci/docker/"

View File

@ -18,28 +18,31 @@ COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
ARG PYTHON_VERSION
ARG PIP_CMAKE
# Put venv into the env vars so users don't need to activate it
ENV PATH /var/lib/jenkins/ci_env/bin:$PATH
ENV VIRTUAL_ENV /var/lib/jenkins/ci_env
COPY requirements-ci.txt /opt/requirements-ci.txt
COPY ./common/install_python.sh install_python.sh
RUN bash ./install_python.sh && rm install_python.sh /opt/requirements-ci.txt
# Install cuda and cudnn
ARG CUDA_VERSION
COPY ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu* install_cusparselt.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
# Note that Docker build forbids copying file outside the build context
COPY ./common/install_linter.sh install_linter.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_linter.sh
RUN rm install_linter.sh common_utils.sh
RUN rm install_linter.sh
RUN chown -R jenkins:jenkins /var/lib/jenkins/ci_env
USER jenkins
CMD ["bash"]

View File

@ -15,20 +15,17 @@ COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
ARG PYTHON_VERSION
ENV PATH /var/lib/jenkins/ci_env/bin:$PATH
ENV VIRTUAL_ENV /var/lib/jenkins/ci_env
COPY requirements-ci.txt /opt/requirements-ci.txt
COPY ./common/install_python.sh install_python.sh
RUN bash ./install_python.sh && rm install_python.sh /opt/requirements-ci.txt
# Note that Docker build forbids copying file outside the build context
COPY ./common/install_linter.sh install_linter.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_linter.sh
RUN rm install_linter.sh common_utils.sh
RUN rm install_linter.sh
USER jenkins
CMD ["bash"]

View File

@ -1,207 +0,0 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=centos:7
FROM centos:7 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
# Note: This is required patch since CentOS have reached EOL
# otherwise any yum install setp will fail
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
# Note: After running yum-config-manager --enable rhel-server-rhscl-7-rpms
# patch is required once again. Somehow this steps adds mirror.centos.org
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN yum --enablerepo=extras install -y epel-release
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
RUN yum install -y autoconf aclocal automake make sudo
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# EPEL for cmake
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.1
ARG DEVTOOLSET_VERSION=9
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
ENV PATH /opt/conda/bin:$PATH
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake is already installed inside the rocm base image, so remove if present
RUN rpm -e cmake || true
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
# ninja
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
FROM cpu_final as rocm_final
ARG ROCM_VERSION=3.7
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)
# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.
# Remove below when ROCm5.7 is not in support matrix anymore.
ENV ROCM_PATH /opt/rocm
ENV MKLROOT /opt/intel
# No need to install ROCm as base docker image should have full ROCm install
#ADD ./common/install_rocm.sh install_rocm.sh
#RUN ROCM_VERSION=${ROCM_VERSION} bash ./install_rocm.sh && rm install_rocm.sh
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
# cmake3 is needed for the MIOpen build
RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

View File

@ -1,153 +0,0 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=10.2
ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
FROM quay.io/pypa/manylinux2014_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
RUN yum install -y yum-utils centos-release-scl sudo
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.2
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
# ninja
RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

View File

@ -7,8 +7,8 @@ ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=11
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
ARG DEVTOOLSET_VERSION=13
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-gcc gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran gcc-toolset-${DEVTOOLSET_VERSION}-gdb
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
@ -33,10 +33,13 @@ RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=11.8
ARG BASE_CUDA_VERSION=12.6
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu* install_cusparselt.sh
FROM base as intel
# MKL
@ -44,7 +47,7 @@ ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
ARG BASE_CUDA_VERSION=12.6
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
@ -61,7 +64,7 @@ ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
ARG DEVTOOLSET_VERSION=11
ARG DEVTOOLSET_VERSION=13
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
@ -84,13 +87,12 @@ RUN yum install -y \
wget \
which \
xz \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
glibc-langpack-en
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
glibc-langpack-en \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
gcc-toolset-${DEVTOOLSET_VERSION}-gdb
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
@ -114,8 +116,8 @@ COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=11.8
ARG DEVTOOLSET_VERSION=11
ARG BASE_CUDA_VERSION=12.6
ARG DEVTOOLSET_VERSION=13
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
@ -154,11 +156,14 @@ ENV ROCM_PATH /opt/rocm
# and avoid 3.21.0 cmake+ninja issues with ninja inserting "-Wl,--no-as-needed" in LINK_FLAGS for static linker
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
# replace the libdrm in /opt/amdgpu with custom amdgpu.ids lookup path
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
# ROCm 6.4 rocm-smi depends on system drm.h header
RUN yum install -y libdrm-devel
ENV MKLROOT /opt/intel
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION} && rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
@ -169,6 +174,6 @@ ENV XPU_DRIVER_TYPE ROLLING
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
ADD ./common/install_xpu.sh install_xpu.sh
ENV XPU_VERSION 2025.0
ENV XPU_VERSION 2025.1
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -1,7 +1,6 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Graviton needs GCC 10 or above for the build. GCC12 is the default version in almalinux-8.
ARG GCCTOOLSET_VERSION=11
ARG GCCTOOLSET_VERSION=13
# Language variabes
ENV LC_ALL=en_US.UTF-8
@ -36,7 +35,16 @@ RUN yum install -y \
yasm \
zstd \
sudo \
gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
gcc-toolset-${GCCTOOLSET_VERSION}-gcc \
gcc-toolset-${GCCTOOLSET_VERSION}-gcc-c++ \
gcc-toolset-${GCCTOOLSET_VERSION}-gcc-gfortran \
gcc-toolset-${GCCTOOLSET_VERSION}-gdb
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH

View File

@ -1,94 +0,0 @@
FROM quay.io/pypa/manylinux2014_aarch64 as base
# Graviton needs GCC 10 for the build
ARG DEVTOOLSET_VERSION=10
# Language variabes
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
less \
zstd \
libgomp \
sudo \
devtoolset-${DEVTOOLSET_VERSION}-gcc \
devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ \
devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
devtoolset-${DEVTOOLSET_VERSION}-binutils
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
###############################################################################
# libglfortran.a hack
#
# libgfortran.a from quay.io/pypa/manylinux2014_aarch64 is not compiled with -fPIC.
# This causes __stack_chk_guard@@GLIBC_2.17 on pytorch build. To solve, get
# ubuntu's libgfortran.a which is compiled with -fPIC
# NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
###############################################################################
RUN cd ~/ \
&& curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb \
&& ar x ~/libgfortran-10-dev.deb \
&& tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
&& cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/
# install cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM openssl as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -1,7 +1,7 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Cuda ARM build needs gcc 11
ARG DEVTOOLSET_VERSION=11
ARG DEVTOOLSET_VERSION=13
# Language variables
ENV LC_ALL=en_US.UTF-8
@ -34,7 +34,10 @@ RUN yum install -y \
zstd \
libgomp \
sudo \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
gcc-toolset-${DEVTOOLSET_VERSION}-gcc \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
gcc-toolset-${DEVTOOLSET_VERSION}-gdb
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
@ -66,8 +69,11 @@ RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION
# Install CUDA
ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
ADD ./common/install_cuda.sh install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./common/install_cusparselt.sh install_cusparselt.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh install_nccl.sh ci_commit_pins/nccl-cu* install_cusparselt.sh
FROM base as magma
ARG BASE_CUDA_VERSION

View File

@ -42,6 +42,7 @@ RUN yum install -y \
llvm-devel \
libzstd-devel \
python3.12-devel \
python3.12-test \
python3.12-setuptools \
python3.12-pip \
python3-virtualenv \
@ -101,24 +102,33 @@ CMD ["/bin/bash"]
# install test dependencies:
# - grpcio requires system openssl, bundled crypto fails to build
# - ml_dtypes 0.4.0 requires some fixes provided in later commits to build
RUN dnf install -y \
protobuf-devel \
protobuf-c-devel \
protobuf-lite-devel \
wget \
patch
hdf5-devel \
python3-h5py \
git
RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio==1.65.4
RUN cd ~ && \
git clone https://github.com/jax-ml/ml_dtypes && \
cd ml_dtypes && \
git checkout v0.4.0 && \
RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio
# cmake-3.28.0 from pip for onnxruntime
RUN python3 -mpip install cmake==3.28.0
# build onnxruntime 1.21.0 from sources.
# it is not possible to build it from sources using pip,
# so just build it from upstream repository.
# h5py is dependency of onnxruntime_training.
# h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
# install newest flatbuffers version first:
# for some reason old version is getting pulled in otherwise.
# packaging package is required for onnxruntime wheel build.
RUN pip3 install flatbuffers && \
pip3 install h5py==3.11.0 && \
pip3 install packaging && \
git clone https://github.com/microsoft/onnxruntime && \
cd onnxruntime && git checkout v1.21.0 && \
git submodule update --init --recursive && \
wget https://github.com/jax-ml/ml_dtypes/commit/b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
wget https://github.com/jax-ml/ml_dtypes/commit/d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
patch -p1 < b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
patch -p1 < d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
python3 setup.py bdist_wheel && \
pip3 install dist/*.whl && \
rm -rf ml_dtypes
./build.sh --config Release --parallel 0 --enable_pybind --build_wheel --enable_training --enable_training_apis --enable_training_ops --skip_tests --allow_running_as_root && \
pip3 install ./build/Linux/Release/dist/onnxruntime_training-*.whl && \
cd .. && /bin/rm -rf ./onnxruntime

View File

@ -1,7 +1,7 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
set -exou pipefail
TOPDIR=$(git rev-parse --show-toplevel)
@ -9,151 +9,108 @@ image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
echo "Usage: $0 IMAGE:ARCHTAG"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
# Go from imagename:tag to tag
DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')
DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"
GPU_ARCH_VERSION=""
if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then
# extract cuda version from image name. e.g. manylinux2_28-builder:cuda12.8 returns 12.8
GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')
elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then
# extract rocm version from image name. e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4
GPU_ARCH_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')
fi
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
WITH_PUSH=${WITH_PUSH:-}
case ${GPU_ARCH_TYPE} in
cpu)
case ${image} in
manylinux2_28-builder:cpu)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
;;
cpu-manylinux_2_28)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
cpu-aarch64)
manylinux2_28_aarch64-builder:cpu-aarch64)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"
MANY_LINUX_VERSION="aarch64"
;;
cpu-aarch64-2_28)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
;;
cpu-cxx11-abi)
manylinuxcxx11-abi-builder:cpu-cxx11-abi)
TARGET=final
DOCKER_TAG=cpu-cxx11-abi
GPU_IMAGE=""
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
MANY_LINUX_VERSION="cxx11-abi"
;;
cpu-s390x)
manylinuxs390x-builder:cpu-s390x)
TARGET=final
DOCKER_TAG=cpu-s390x
GPU_IMAGE=s390x/almalinux:8
DOCKER_GPU_BUILD_ARG=""
MANY_LINUX_VERSION="s390x"
;;
cuda)
manylinux2_28-builder:cuda11*)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
# Keep this up to date with the minimum version of CUDA we currently support
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"
;;
cuda-manylinux_2_28)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
cuda-aarch64)
manylinux2_28-builder:cuda12*)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
manylinuxaarch64-builder:cuda*)
TARGET=cuda_final
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="aarch64"
DOCKERFILE_SUFFIX="_cuda_aarch64"
;;
rocm|rocm-manylinux_2_28)
manylinux2_28-builder:rocm*)
TARGET=rocm_final
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete
DEVTOOLSET_VERSION="9"
if [ ${GPU_ARCH_TYPE} == "rocm-manylinux_2_28" ]; then
MANY_LINUX_VERSION="2_28"
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
fi
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101"
MANY_LINUX_VERSION="2_28"
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
xpu)
manylinux2_28-builder:xpu)
TARGET=xpu_final
DOCKER_TAG=xpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
echo "ERROR: Unrecognized image name: ${image}"
exit 1
;;
esac
IMAGES=''
if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then
DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}
fi
(
set -x
if [ "$(uname -m)" != "s390x" ]; then
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
fi
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
# Only activate this if in CI
if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
fi
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \
-t "${tmp_tag}" \
$@ \
-f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \
"${TOPDIR}/.ci/docker/"

View File

@ -97,7 +97,7 @@ find /opt/_internal -type f -print0 \
| xargs -0 -n1 strip --strip-unneeded 2>/dev/null || true
# We do not need the Python test suites, or indeed the precompiled .pyc and
# .pyo files. Partially cribbed from:
# https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile
# https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile # @lint-ignore
find /opt/_internal \
\( -type d -a -name test -o -name tests \) \
-o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \

View File

@ -2,8 +2,8 @@
# Helper utilities for build
# Script used only in CD pipeline
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/
CURL_DOWNLOAD_URL=https://curl.askapache.com/download
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/ # @lint-ignore
CURL_DOWNLOAD_URL=https://curl.se/download
AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

View File

@ -30,10 +30,10 @@ dill==0.3.7
#Pinned versions: 0.3.7
#test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
expecttest==0.2.1
expecttest==0.3.0
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
#Pinned versions: 0.2.1
#Pinned versions: 0.3.0
#test that import:
fbscribelogger==0.1.7
@ -41,11 +41,14 @@ fbscribelogger==0.1.7
#Pinned versions: 0.1.6
#test that import:
flatbuffers==2.0
flatbuffers==2.0 ; platform_machine != "s390x"
#Description: cross platform serialization library
#Pinned versions: 2.0
#test that import:
flatbuffers ; platform_machine == "s390x"
#Description: cross platform serialization library; Newer version is required on s390x for new python version
hypothesis==5.35.1
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests
@ -90,10 +93,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.13.0
mypy==1.14.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.10.0
#Pinned versions: 1.14.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -102,10 +105,10 @@ networkx==2.8.8
#Pinned versions: 2.8.8
#test that import: functorch
#ninja
#Description: build system. Note that it install from
#here breaks things so it is commented out
#Pinned versions: 1.10.0.post1
ninja==1.11.1.3
#Description: build system. Used in some tests. Used in build to generate build
#time tracing information
#Pinned versions: 1.11.1.3
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.49.0 ; python_version < "3.9"
@ -280,9 +283,9 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#test that import:
#lintrunner is supported on aarch64-linux only from 0.12.4 version
lintrunner==0.12.5
lintrunner==0.12.7
#Description: all about linters!
#Pinned versions: 0.12.5
#Pinned versions: 0.12.7
#test that import:
redis>=4.0.0
@ -294,7 +297,7 @@ ghstack==0.8.0
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.4
jinja2==3.1.6
#Description: jinja2 template engine
#Pinned versions: 3.1.4
#test that import:
@ -304,7 +307,7 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver==4.12.2.0
z3-solver==4.12.6.0
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
@ -329,7 +332,7 @@ lxml==5.3.0
PyGithub==2.3.0
sympy==1.13.1 ; python_version >= "3.9"
sympy==1.13.3
#Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
#Pinned versions:
#test that import:
@ -339,7 +342,7 @@ onnx==1.17.0
#Pinned versions:
#test that import:
onnxscript==0.1.0.dev20240817
onnxscript==0.2.2
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -353,7 +356,7 @@ parameterized==0.8.1
#Pinned versions: 1.24.0
#test that import: test_sac_estimator.py
pwlf==2.2.1 ; python_version >= "3.8"
pwlf==2.2.1
#Description: required for testing torch/distributed/_tools/sac_estimator.py
#Pinned versions: 2.2.1
#test that import: test_sac_estimator.py
@ -362,12 +365,20 @@ pwlf==2.2.1 ; python_version >= "3.8"
# To build PyTorch itself
astunparse
PyYAML
pyzstd
setuptools
ninja==1.11.1 ; platform_machine == "aarch64"
scons==4.5.2 ; platform_machine == "aarch64"
pulp==2.9.0 ; python_version >= "3.8"
pulp==2.9.0
#Description: required for testing ilp formulaiton under torch/distributed/_tools
#Pinned versions: 2.9.0
#test that import: test_sac_ilp.py
dataclasses_json==0.6.7
#Description: required for data pipeline and scripts under tools/stats
#Pinned versions: 0.6.7
#test that import:
cmake==4.0.0
#Description: required for building

View File

@ -1,15 +1,20 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought is probably
# something related to Docker setup. We can investigate this later
sphinxcontrib.katex==0.8.6
#Description: This is used to generate PyTorch docs
#Pinned versions: 0.8.6
sphinxext-opengraph==0.9.1
#Description: This is used to generate PyTorch docs
#Pinned versions: 0.9.1
matplotlib==3.5.3
#Description: This is used to generate PyTorch docs
#Pinned versions: 3.5.3
@ -46,5 +51,6 @@ myst-nb==0.17.2
# The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
python-etcd==0.4.5
sphinx-copybutton==0.5.0
sphinx-panels==0.4.1
sphinx-design==0.4.0
sphinxcontrib-mermaid==1.0.0
myst-parser==0.18.1

View File

@ -1 +1 @@
3.2.0
3.3.0

View File

@ -2,7 +2,7 @@ ARG UBUNTU_VERSION
ARG CUDA_VERSION
ARG IMAGE_NAME
FROM ${IMAGE_NAME}
FROM ${IMAGE_NAME} as base
ARG UBUNTU_VERSION
ARG CUDA_VERSION
@ -26,7 +26,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ARG CONDA_CMAKE
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
@ -43,20 +42,6 @@ ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
@ -90,21 +75,21 @@ COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
ARG TRITON
FROM base as triton-builder
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
RUN bash ./install_triton.sh
FROM base as final
COPY --from=triton-builder /opt/triton /opt/triton
RUN if [ -n "${TRITON}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi
RUN rm -rf /opt/triton
ARG HALIDE
# Build and install halide
@ -159,6 +144,16 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash install_cusparselt.sh
RUN rm install_cusparselt.sh
# Install NCCL
ARG CUDA_VERSION
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash install_nccl.sh
RUN rm install_nccl.sh /ci_commit_pins/nccl-cu*
ENV USE_SYSTEM_NCCL=1
ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# Install CUDSS
ARG CUDA_VERSION
COPY ./common/install_cudss.sh install_cudss.sh

View File

@ -14,19 +14,17 @@ ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install clang
ARG LLVMDEV
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install katex
ARG KATEX
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
@ -39,19 +37,10 @@ ARG GCC_VERSION
COPY ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# Install clang
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install vision packages like OpenCV
ARG VISION
@ -66,7 +55,7 @@ COPY ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}
RUN rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
@ -85,11 +74,31 @@ COPY ./common/install_amdsmi.sh install_amdsmi.sh
RUN bash ./install_amdsmi.sh
RUN rm install_amdsmi.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install UCC
ARG UCX_COMMIT
ARG UCC_COMMIT
ENV UCX_COMMIT $UCX_COMMIT
ENV UCC_COMMIT $UCC_COMMIT
ENV UCX_HOME /usr
ENV UCC_HOME /usr
ADD ./common/install_ucc.sh install_ucc.sh
RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
RUN rm install_ucc.sh
COPY ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
@ -107,18 +116,17 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Install Open MPI for ROCm
COPY ./common/install_openmpi.sh install_openmpi.sh
RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi
RUN rm install_openmpi.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

View File

@ -28,7 +28,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
@ -77,13 +76,6 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
@ -91,12 +83,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh

View File

@ -1,6 +1,6 @@
ARG UBUNTU_VERSION
FROM ubuntu:${UBUNTU_VERSION}
FROM ubuntu:${UBUNTU_VERSION} as base
ARG UBUNTU_VERSION
@ -28,7 +28,6 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
@ -52,9 +51,17 @@ RUN bash ./install_lcov.sh && rm install_lcov.sh
# Install cuda and cudnn
ARG CUDA_VERSION
COPY ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh install_nccl.sh /ci_commit_pins/nccl-cu* install_cusparselt.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
# No effect if cuda not installed
ENV USE_SYSTEM_NCCL=1
ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# (optional) Install UCC
ARG UCX_COMMIT
@ -67,20 +74,6 @@ ADD ./common/install_ucc.sh install_ucc.sh
RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
RUN rm install_ucc.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
@ -88,24 +81,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install Vulkan SDK
ARG VULKAN_SDK_VERSION
COPY ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh
RUN if [ -n "${VULKAN_SDK_VERSION}" ]; then bash ./install_vulkan_sdk.sh; fi
RUN rm install_vulkan_sdk.sh
# (optional) Install swiftshader
ARG SWIFTSHADER
COPY ./common/install_swiftshader.sh install_swiftshader.sh
RUN if [ -n "${SWIFTSHADER}" ]; then bash ./install_swiftshader.sh; fi
RUN rm install_swiftshader.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh
@ -127,20 +102,21 @@ RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_d
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
ARG TRITON_CPU
# Create a separate stage for building Triton and Triton-CPU. install_triton
# will check for the presence of env vars
FROM base as triton-builder
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton.txt triton.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
ARG TRITON_CPU
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt
RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-cpu.txt
RUN bash ./install_triton.sh
FROM base as final
COPY --from=triton-builder /opt/triton /opt/triton
RUN if [ -n "${TRITON}" ] || [ -n "${TRITON_CPU}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi
RUN rm -rf /opt/triton
ARG EXECUTORCH
# Build and install executorch

2
.ci/magma-rocm/.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
output/
magma-rocm*/

35
.ci/magma-rocm/Makefile Normal file
View File

@ -0,0 +1,35 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_ROCM ?= 6.4
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-v $(shell git rev-parse --show-toplevel)/.ci:/builder \
-w /builder \
-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_ROCM_SHORT} \
-e DESIRED_ROCM=${DESIRED_ROCM} \
"pytorch/almalinux-builder:rocm${DESIRED_ROCM}" \
magma-rocm/build_magma.sh
.PHONY: all
all: magma-rocm64
all: magma-rocm63
.PHONY:
clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-rocm64
magma-rocm64: DESIRED_ROCM := 6.4
magma-rocm64:
$(DOCKER_RUN)
.PHONY: magma-rocm63
magma-rocm63: DESIRED_ROCM := 6.3
magma-rocm63:
$(DOCKER_RUN)

48
.ci/magma-rocm/README.md Normal file
View File

@ -0,0 +1,48 @@
# Magma ROCm
This folder contains the scripts and configurations to build libmagma.so, linked for various versions of ROCm.
## Building
Look in the `Makefile` for available targets to build. To build any target, for example `magma-rocm63`, run
```
# Using `docker`
make magma-rocm63
# Using `podman`
DOCKER_CMD=podman make magma-rocm63
```
This spawns a `pytorch/manylinux-rocm<version>` docker image, which has the required `devtoolset` and ROCm versions installed.
Within the docker image, it runs `build_magma.sh` with the correct environment variables set, which package the necessary files
into a tarball, with the following structure:
```
.
├── include # header files
├── lib # libmagma.so
├── info
│ ├── licenses # license file
│ └── recipe # build script
```
More specifically, `build_magma.sh` copies over the relevant files from the `package_files` directory depending on the ROCm version.
Outputted binaries should be in the `output` folder.
## Pushing
Packages can be uploaded to an S3 bucket using:
```
aws s3 cp output/*/magma-cuda*.bz2 <bucket-with-path>
```
If you do not have upload permissions, please ping @seemethere or @soumith to gain access
## New versions
New ROCm versions can be added by creating a new make target with the next desired version. For ROCm version N.n, the target should be named `magma-rocmNn`.
Make sure to edit the appropriate environment variables (e.g., DESIRED_ROCM) in the `Makefile` accordingly. Remember also to check `build_magma.sh` to ensure the logic for copying over the files remains correct.

42
.ci/magma-rocm/build_magma.sh Executable file
View File

@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -eou pipefail
# Environment variables
# The script expects DESIRED_CUDA and PACKAGE_NAME to be set
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6
# Folders for the build
PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata
PACKAGE_DIR=${ROOT_DIR}/magma-rocm/${PACKAGE_NAME} # build workspace
PACKAGE_OUTPUT=${ROOT_DIR}/magma-rocm/output # where tarballs are stored
PACKAGE_BUILD=${PACKAGE_DIR} # where the content of the tarball is prepared
PACKAGE_RECIPE=${PACKAGE_BUILD}/info/recipe
PACKAGE_LICENSE=${PACKAGE_BUILD}/info/licenses
mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RECIPE} ${PACKAGE_LICENSE}
# Fetch magma sources and verify checksum
pushd ${PACKAGE_DIR}
git clone https://bitbucket.org/icl/magma.git
pushd magma
git checkout ${MAGMA_VERSION}
popd
popd
# build
pushd ${PACKAGE_DIR}/magma
# The build.sh script expects to be executed from the sources root folder
INSTALL_DIR=${PACKAGE_BUILD} ${PACKAGE_FILES}/build.sh
popd
# Package recipe, license and tarball
# Folder and package name are backward compatible for the build workflow
cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh
cp ${PACKAGE_DIR}/magma/COPYRIGHT ${PACKAGE_LICENSE}/COPYRIGHT
pushd ${PACKAGE_BUILD}
tar cjf ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2 include lib info
echo Built in ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2
popd

View File

@ -0,0 +1,38 @@
# Magma build scripts need `python`
ln -sf /usr/bin/python3 /usr/bin/python
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
almalinux)
yum install -y gcc-gfortran
;;
*)
echo "No preinstalls to build magma..."
;;
esac
MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then
echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc
fi
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc
done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"
make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"
cp -R lib ${INSTALL_DIR}
cp -R include ${INSTALL_DIR}

View File

@ -12,13 +12,12 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \
-e DESIRED_CUDA=${DESIRED_CUDA} \
-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \
"pytorch/manylinux-builder:cuda${DESIRED_CUDA}-main" \
"pytorch/almalinux-builder:cuda${DESIRED_CUDA}-main" \
magma/build_magma.sh
.PHONY: all
all: magma-cuda128
all: magma-cuda126
all: magma-cuda124
all: magma-cuda121
all: magma-cuda118
.PHONY:
@ -26,21 +25,17 @@ clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-cuda128
magma-cuda128: DESIRED_CUDA := 12.8
magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
magma-cuda128:
$(DOCKER_RUN)
.PHONY: magma-cuda126
magma-cuda126: DESIRED_CUDA := 12.6
magma-cuda126:
$(DOCKER_RUN)
.PHONY: magma-cuda124
magma-cuda124: DESIRED_CUDA := 12.4
magma-cuda124:
$(DOCKER_RUN)
.PHONY: magma-cuda121
magma-cuda121: DESIRED_CUDA := 12.1
magma-cuda121:
$(DOCKER_RUN)
.PHONY: magma-cuda118
magma-cuda118: DESIRED_CUDA := 11.8
magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37

View File

@ -111,12 +111,6 @@ case ${DESIRED_PYTHON} in
;;
esac
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
export _GLIBCXX_USE_CXX11_ABI=1
else
export _GLIBCXX_USE_CXX11_ABI=0
fi
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
echo "Calling build_amd.py at $(date)"
python tools/amd_build/build_amd.py
@ -209,12 +203,6 @@ if [[ -n "$BUILD_PYTHONLESS" ]]; then
mkdir -p /tmp/$LIBTORCH_HOUSE_DIR
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
LIBTORCH_ABI="cxx11-abi-"
else
LIBTORCH_ABI=
fi
zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch
cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \
/tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip
@ -333,8 +321,8 @@ for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.w
# ROCm workaround for roctracer dlopens
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
patchedpath=$(fname_without_so_number $destpath)
# Keep the so number for XPU dependencies
elif [[ "$DESIRED_CUDA" == *"xpu"* ]]; then
# Keep the so number for XPU dependencies and libgomp.so.1 to avoid twice load
elif [[ "$DESIRED_CUDA" == *"xpu"* || "$filename" == "libgomp.so.1" ]]; then
patchedpath=$destpath
else
patchedpath=$(fname_with_sha256 $destpath)

View File

@ -14,6 +14,7 @@ export USE_CUDA_STATIC_LINK=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
export USE_CUPTI_SO=0
export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build
export USE_CUFILE=${USE_CUFILE:-1}
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
@ -43,13 +44,6 @@ if [[ -n "$DESIRED_CUDA" ]]; then
fi
fi
echo "Using CUDA $CUDA_VERSION as determined by DESIRED_CUDA"
# There really has to be a better way to do this - eli
# Possibly limiting builds to specific cuda versions be delimiting images would be a choice
if [[ "$OS_NAME" == *"Ubuntu"* ]]; then
echo "Switching to CUDA version ${DESIRED_CUDA}"
/builder/conda/switch_cuda_version.sh "${DESIRED_CUDA}"
fi
else
CUDA_VERSION=$(nvcc --version|grep release|cut -f5 -d" "|cut -f1 -d",")
echo "CUDA $CUDA_VERSION Detected"
@ -59,23 +53,15 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.8)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.6)
if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then
TORCH_CUDA_ARCH_LIST="9.0"
else
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"
fi
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.4)
if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then
TORCH_CUDA_ARCH_LIST="9.0"
else
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
fi
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.1)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
@ -133,7 +119,16 @@ if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then
)
fi
if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then
# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are
# not available in PYPI
if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then
export USE_CUFILE=0
fi
# CUDA_VERSION 12.4, 12.6, 12.8
if [[ $CUDA_VERSION == 12* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
@ -174,6 +169,16 @@ if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then
"libnvrtc.so.12"
"libnvrtc-builtins.so"
)
if [[ $USE_CUFILE == 1 ]]; then
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
)
DEPS_SONAME+=(
"libcufile.so.0"
"libcufile_rdma.so.1"
)
fi
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
@ -190,6 +195,11 @@ if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
if [[ $USE_CUFILE == 1 ]]; then
CUDA_RPATHS+=(
'$ORIGIN/../../nvidia/cufile/lib'
)
fi
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
@ -275,7 +285,7 @@ else
exit 1
fi
# builder/test.sh requires DESIRED_CUDA to know what tests to exclude
# run_tests.sh requires DESIRED_CUDA to know what tests to exclude
export DESIRED_CUDA="$cuda_version_nodot"
# Switch `/usr/local/cuda` to the desired CUDA version

View File

@ -95,12 +95,6 @@ python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
export _GLIBCXX_USE_CXX11_ABI=1
else
export _GLIBCXX_USE_CXX11_ABI=0
fi
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
echo "Calling build_amd.py at $(date)"
python tools/amd_build/build_amd.py
@ -169,12 +163,6 @@ fi
)
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
LIBTORCH_ABI="cxx11-abi-"
else
LIBTORCH_ABI=
fi
(
set -x

View File

@ -118,7 +118,7 @@ if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then
fi
LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"
LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"
if [[ $ROCM_INT -ge 60100 ]]; then
if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then
# Below libs are direct dependencies of libhipsolver
LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
@ -151,7 +151,7 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
fi
LIBDRM_PATH="/usr/lib/x86_64-linux-gnu/libdrm.so.2"
LIBDRM_AMDGPU_PATH="/usr/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1"
if [[ $ROCM_INT -ge 60100 ]]; then
if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then
# Below libs are direct dependencies of libhipsolver
LIBCHOLMOD_PATH="/lib/x86_64-linux-gnu/libcholmod.so.3"
# Below libs are direct dependencies of libcholmod
@ -186,15 +186,6 @@ do
OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array
done
# FIXME: Temporary until https://github.com/pytorch/pytorch/pull/137443 lands
# Install AOTriton
if [ -e ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt ]; then
cp -a ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt aotriton_version.txt
bash ${PYTORCH_ROOT}/.ci/docker/common/install_aotriton.sh ${ROCM_HOME} && rm aotriton_version.txt
export AOTRITON_INSTALLED_PREFIX=${ROCM_HOME}/aotriton
ROCM_SO_FILES+=("libaotriton_v2.so")
fi
# rocBLAS library files
ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
ROCBLAS_LIB_DST=lib/rocblas/library
@ -266,20 +257,6 @@ RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))
DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})
DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})
# PyTorch 2.6+ (AOTriton 0.8b+)
# AKS = "AOTriton Kernel Storage", a file format to store GPU kernels compactly
if (( $(echo "${PYTORCH_VERSION} 2.6" | awk '{print ($1 >= $2)}') )); then
LIBAOTRITON_DIR=$(find "$ROCM_HOME/lib/" -name "libaotriton_v2.so" -printf '%h\n')
if [[ -z ${LIBAOTRITON_DIR} ]]; then
LIBAOTRITON_DIR=$(find "$ROCM_HOME/" -name "libaotriton_v2.so" -printf '%h\n')
fi
AKS_FILES=($(find "${LIBAOTRITON_DIR}/aotriton.images" -type f -name '*.aks?' -printf '%P\n'))
AKS_SRC="${LIBAOTRITON_DIR}/aotriton.images"
AKS_DST="lib/aotriton.images"
DEPS_AUX_SRCLIST+=(${AKS_FILES[@]/#/${AKS_SRC}/})
DEPS_AUX_DSTLIST+=(${AKS_FILES[@]/#/${AKS_DST}/})
fi
echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

View File

@ -20,7 +20,11 @@ fi
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/pti/latest/env/vars.sh
source /opt/intel/oneapi/umf/latest/env/vars.sh
source /opt/intel/oneapi/ccl/latest/env/vars.sh
source /opt/intel/oneapi/mpi/latest/env/vars.sh
export USE_STATIC_MKL=1
export USE_ONEMKL=1
export USE_XCCL=1
WHEELHOUSE_DIR="wheelhousexpu"
LIBTORCH_HOUSE_DIR="libtorch_housexpu"

View File

@ -10,5 +10,3 @@ example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are
built on Jenkins and are used in triggered builds already have this
environment variable set in their manifest. Also see
`./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.
Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

View File

@ -35,7 +35,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *clang* ]]; then
# TODO: there is a linking issue when building with UCC using clang,
# disable it for now and to be fix later.
# TODO: disable UCC temporarily to enable CUDA 12.1 in CI
@ -99,30 +99,6 @@ if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export ACL_ROOT_DIR=/ComputeLibrary
fi
if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then
POSSIBLE_JAVA_HOMES=()
POSSIBLE_JAVA_HOMES+=(/usr/local)
POSSIBLE_JAVA_HOMES+=(/usr/lib/jvm/java-8-openjdk-amd64)
POSSIBLE_JAVA_HOMES+=(/Library/Java/JavaVirtualMachines/*.jdk/Contents/Home)
# Add the Windows-specific JNI
POSSIBLE_JAVA_HOMES+=("$PWD/.circleci/windows-jni/")
for JH in "${POSSIBLE_JAVA_HOMES[@]}" ; do
if [[ -e "$JH/include/jni.h" ]] ; then
# Skip if we're not on Windows but haven't found a JAVA_HOME
if [[ "$JH" == "$PWD/.circleci/windows-jni/" && "$OSTYPE" != "msys" ]] ; then
break
fi
echo "Found jni.h under $JH"
export JAVA_HOME="$JH"
export BUILD_JNI=ON
break
fi
done
if [ -z "$JAVA_HOME" ]; then
echo "Did not find jni.h"
fi
fi
# Use special scripts for Android builds
if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
export ANDROID_NDK=/opt/ndk
@ -171,8 +147,15 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# shellcheck disable=SC1091
source /opt/intel/oneapi/ccl/latest/env/vars.sh
# shellcheck disable=SC1091
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# Enable XCCL build
export USE_XCCL=1
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
export TORCH_XPU_ARCH_LIST=pvc
fi
# sccache will fail for CUDA builds if all cores are used for compiling
@ -191,7 +174,7 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
@ -228,7 +211,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
export CMAKE_BUILD_TYPE=RelWithAssert
fi
# Do not change workspace permissions for ROCm CI jobs
# Do not change workspace permissions for ROCm and s390x CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
@ -247,7 +230,7 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v
fi
if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e
set -e -o pipefail
get_bazel
@ -276,10 +259,8 @@ else
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0.2 for builds which are backward compatible with 1.X
python -mpip install --pre numpy==2.0.2
fi
# Install numpy-2.0.2 for builds which are backward compatible with 1.X
python -mpip install numpy==2.0.2
WERROR=1 python setup.py clean
@ -302,6 +283,18 @@ else
fi
pip_install_whl "$(echo dist/*.whl)"
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
echo "Checking that xpu is compiled"
pushd dist/
if python -c 'import torch; exit(0 if torch.xpu._is_compiled() else 1)'; then
echo "XPU support is compiled in."
else
echo "XPU support is NOT compiled in."
exit 1
fi
popd
fi
# TODO: I'm not sure why, but somehow we lose verbose commands
set -x
@ -377,8 +370,10 @@ else
# This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization
# is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has
# 16 CPUs
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
if [ -z "$MAX_JOBS_OVERRIDE" ]; then
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
fi
# NB: Install outside of source directory (at the same level as the root
# pytorch folder) so that it doesn't get cleaned away prior to docker push.

View File

@ -59,78 +59,16 @@ else
export install_root="$(dirname $(which python))/../lib/python${py_dot}/site-packages/torch/"
fi
###############################################################################
# Setup XPU ENV
###############################################################################
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
set +u
# Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/pti/latest/env/vars.sh
fi
###############################################################################
# Check GCC ABI
###############################################################################
# NOTE [ Building libtorch with old vs. new gcc ABI ]
#
# Packages built with one version of ABI could not be linked against by client
# C++ libraries that were compiled using the other version of ABI. Since both
# gcc ABIs are still common in the wild, we need to support both ABIs. Currently:
#
# - All the nightlies built on CentOS 7 + devtoolset7 use the old gcc ABI.
# - All the nightlies built on Ubuntu 16.04 + gcc 5.4 use the new gcc ABI.
# NOTE: As of https://github.com/pytorch/pytorch/issues/126551 we only produce
# wheels with cxx11-abi
echo "Checking that the gcc ABI is what we expect"
if [[ "$(uname)" != 'Darwin' ]]; then
function is_expected() {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$DESIRED_CUDA" == *"rocm"* ]]; then
if [[ "$1" -gt 0 || "$1" == "ON " ]]; then
echo 1
fi
else
if [[ -z "$1" || "$1" == 0 || "$1" == "OFF" ]]; then
echo 1
fi
fi
}
# First we check that the env var in TorchConfig.cmake is correct
# We search for D_GLIBCXX_USE_CXX11_ABI=1 in torch/TorchConfig.cmake
torch_config="${install_root}/share/cmake/Torch/TorchConfig.cmake"
if [[ ! -f "$torch_config" ]]; then
echo "No TorchConfig.cmake found!"
ls -lah "$install_root/share/cmake/Torch"
exit 1
fi
echo "Checking the TorchConfig.cmake"
cat "$torch_config"
# The sed call below is
# don't print lines by default (only print the line we want)
# -n
# execute the following expression
# e
# replace lines that match with the first capture group and print
# s/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p
# any characters, D_GLIBCXX_USE_CXX11_ABI=, exactly one any character, a
# quote, any characters
# Note the exactly one single character after the '='. In the case that the
# variable is not set the '=' will be followed by a '"' immediately and the
# line will fail the match and nothing will be printed; this is what we
# want. Otherwise it will capture the 0 or 1 after the '='.
# /.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/
# replace the matched line with the capture group and print
# /\1/p
actual_gcc_abi="$(sed -ne 's/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p' < "$torch_config")"
if [[ "$(is_expected "$actual_gcc_abi")" != 1 ]]; then
echo "gcc ABI $actual_gcc_abi not as expected."
exit 1
fi
# We also check that there are [not] cxx11 symbols in libtorch
# We also check that there are cxx11 symbols in libtorch
#
echo "Checking that symbols in libtorch.so have the right gcc abi"
python3 "$(dirname ${BASH_SOURCE[0]})/smoke_test/check_binary_symbols.py"
@ -208,35 +146,11 @@ setup_link_flags () {
TEST_CODE_DIR="$(dirname $(realpath ${BASH_SOURCE[0]}))/test_example_code"
build_and_run_example_cpp () {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
GLIBCXX_USE_CXX11_ABI=1
else
GLIBCXX_USE_CXX11_ABI=0
fi
setup_link_flags
g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1
g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1
./$1
}
build_example_cpp_with_incorrect_abi () {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
GLIBCXX_USE_CXX11_ABI=0
else
GLIBCXX_USE_CXX11_ABI=1
fi
set +e
setup_link_flags
g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1
ERRCODE=$?
set -e
if [ "$ERRCODE" -eq "0" ]; then
echo "Building example with incorrect ABI didn't throw error. Aborting."
exit 1
else
echo "Building example with incorrect ABI throws expected error. Proceeding."
fi
}
###############################################################################
# Check simple Python/C++ calls
###############################################################################
@ -246,11 +160,6 @@ if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
export LD_LIBRARY_PATH=/usr/local/cuda/lib64
fi
build_and_run_example_cpp simple-torch-test
# `_GLIBCXX_USE_CXX11_ABI` is always ignored by gcc in devtoolset7, so we test
# the expected failure case for Ubuntu 16.04 + gcc 5.4 only.
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
build_example_cpp_with_incorrect_abi simple-torch-test
fi
else
pushd /tmp
python -c 'import torch'
@ -307,6 +216,14 @@ else
fi
fi
###############################################################################
# Check XPU configured correctly
###############################################################################
if [[ "$DESIRED_CUDA" == 'xpu' && "$PACKAGE_TYPE" != 'libtorch' ]]; then
echo "Checking that xpu is compiled"
python -c 'import torch; exit(0 if torch.xpu._is_compiled() else 1)'
fi
###############################################################################
# Check CUDA configured correctly
###############################################################################
@ -385,10 +302,22 @@ except RuntimeError as e:
fi
###############################################################################
# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries
# Check for C++ ABI compatibility to GCC-11 - GCC 13
###############################################################################
if [[ "$(uname)" == 'Linux' && ("$PACKAGE_TYPE" == 'conda' || "$PACKAGE_TYPE" == 'manywheel')]]; then
if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" == 'manywheel' ]]; then
pushd /tmp
python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"
# Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html
# gcc-11 is ABI16, gcc-13 is ABI18, gcc-14 is ABI19
# gcc 11 - CUDA 11.8, xpu, rocm
# gcc 13 - CUDA 12.6, 12.8 and cpu
# Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426
if [[ "$(uname -m)" == "s390x" ]]; then
cxx_abi="19"
elif [[ "$DESIRED_CUDA" != 'cu118' && "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then
cxx_abi="18"
else
cxx_abi="16"
fi
python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi10${cxx_abi}' else 1)"
popd
fi

View File

@ -3,7 +3,7 @@
# Common setup for all Jenkins scripts
# shellcheck source=./common_utils.sh
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
set -ex
set -ex -o pipefail
# Required environment variables:
# $BUILD_ENVIRONMENT (should be set by your Docker image)
@ -13,10 +13,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
# HIP_PLATFORM is auto-detected by hipcc; unset to avoid build errors
unset HIP_PLATFORM
export PYTORCH_TEST_WITH_ROCM=1
# temporary to locate some kernel issues on the CI nodes
export HSAKMT_DEBUG_LEVEL=4
# improve rccl performance for distributed tests
export HSA_FORCE_FINE_GRAIN_PCIE=1
fi
# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598

View File

@ -160,7 +160,7 @@ function install_torchvision() {
}
function install_tlparse() {
pip_install --user "tlparse==0.3.25"
pip_install --user "tlparse==0.3.30"
PATH="$(python -m site --user-base)/bin:$PATH"
}
@ -169,24 +169,34 @@ function install_torchrec_and_fbgemm() {
torchrec_commit=$(get_pinned_commit torchrec)
local fbgemm_commit
fbgemm_commit=$(get_pinned_commit fbgemm)
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
fbgemm_commit=$(get_pinned_commit fbgemm_rocm)
fi
pip_uninstall torchrec-nightly
pip_uninstall fbgemm-gpu-nightly
pip_install setuptools-git-versioning scikit-build pyre-extensions
# TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it
# seems to be an sccache-related issue
if [[ "$IS_A100_RUNNER" == "1" ]]; then
unset CMAKE_CUDA_COMPILER_LAUNCHER
sudo mv /opt/cache/bin /opt/cache/bin-backup
fi
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
# install torchrec first because it installs fbgemm nightly on top of rocm fbgemm
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_uninstall fbgemm-gpu-nightly
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
if [[ "$IS_A100_RUNNER" == "1" ]]; then
export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
sudo mv /opt/cache/bin-backup /opt/cache/bin
pip_install tabulate # needed for newer fbgemm
pip_install patchelf # needed for rocm fbgemm
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
python setup.py install \
--package_variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
rm -rf fbgemm
else
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
fi
}
@ -216,6 +226,11 @@ function checkout_install_torchbench() {
# to install and test other models
python install.py --continue_on_fail
fi
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd

View File

@ -40,7 +40,7 @@ echo "Building PyTorch C++ API docs..."
rm -rf cppdocs
git clone https://github.com/pytorch/cppdocs
set -ex
set -ex -o pipefail
# Generate ATen files
pushd "${pt_checkout}"

View File

@ -5,7 +5,7 @@ pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.ci/pytorch/common_utils.sh"
echo "functorch_doc_push_script.sh: Invoked with $*"
set -ex
set -ex -o pipefail
version=${DOCS_VERSION:-nightly}
echo "version: $version"

View File

@ -1,31 +1,50 @@
#!/bin/bash
# Script for installing sccache on the xla build job, which uses xla's docker
# image and doesn't have sccache installed on it. This is mostly copied from
# .ci/docker/install_cache.sh. Changes are: removing checks that will always
# return the same thing, ex checks for for rocm, CUDA, and changing the path
# where sccache is installed, and not changing /etc/environment.
# image, which has sccache installed but doesn't write the stubs. This is
# mostly copied from .ci/docker/install_cache.sh. Changes are: removing checks
# that will always return the same thing, ex checks for for rocm, CUDA, changing
# the path where sccache is installed, not changing /etc/environment, and not
# installing/downloading sccache as it is already in the docker image.
set -ex
install_binary() {
echo "Downloading sccache binary from S3 repo"
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /tmp/cache/bin/sccache
}
set -ex -o pipefail
mkdir -p /tmp/cache/bin
mkdir -p /tmp/cache/lib
export PATH="/tmp/cache/bin:$PATH"
install_binary
chmod a+x /tmp/cache/bin/sccache
function write_sccache_stub() {
# Unset LD_PRELOAD for ps because of asan + ps issues
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589
# shellcheck disable=SC2086
# shellcheck disable=SC2059
printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n exec sccache $(which $1) \"\$@\"\nelse\n exec $(which $1) \"\$@\"\nfi" > "/tmp/cache/bin/$1"
if [ "$1" == "gcc" ]; then
# Do not call sccache recursively when dumping preprocessor argument
# For some reason it's very important for the first cached nvcc invocation
cat >"/tmp/cache/bin/$1" <<EOF
#!/bin/sh
# sccache does not support -E flag, so we need to call the original compiler directly in order to avoid calling this wrapper recursively
for arg in "\$@"; do
if [ "\$arg" = "-E" ]; then
exec $(which "$1") "\$@"
fi
done
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache $(which "$1") "\$@"
else
exec $(which "$1") "\$@"
fi
EOF
else
cat >"/tmp/cache/bin/$1" <<EOF
#!/bin/sh
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache $(which "$1") "\$@"
else
exec $(which "$1") "\$@"
fi
EOF
fi
chmod a+x "/tmp/cache/bin/$1"
}

View File

@ -33,56 +33,15 @@ if which sccache > /dev/null; then
export PATH="${tmp_dir}:$PATH"
fi
cross_compile_arm64() {
# Cross compilation for arm64
print_cmake_info
if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
# Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls
USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
else
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
}
compile_arm64() {
# Compilation for arm64
# TODO: Compile with OpenMP support (but this causes CI regressions as cross-compilation were done with OpenMP disabled)
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
}
compile_x86_64() {
USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel --plat-name=macosx_10_9_x86_64
}
build_lite_interpreter() {
echo "Testing libtorch (lite interpreter)."
CPP_BUILD="$(pwd)/../cpp_build"
# Ensure the removal of the tmp directory
trap 'rm -rfv ${CPP_BUILD}' EXIT
rm -rf "${CPP_BUILD}"
mkdir -p "${CPP_BUILD}/caffe2"
# It looks libtorch need to be built in "${CPP_BUILD}/caffe2 folder.
BUILD_LIBTORCH_PY=$PWD/tools/build_libtorch.py
pushd "${CPP_BUILD}/caffe2" || exit
VERBOSE=1 DEBUG=1 python "${BUILD_LIBTORCH_PY}"
popd || exit
"${CPP_BUILD}/caffe2/build/bin/test_lite_interpreter_runtime"
}
print_cmake_info
if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then
if [[ $(uname -m) == "arm64" ]]; then
compile_arm64
else
cross_compile_arm64
fi
elif [[ ${BUILD_ENVIRONMENT} = *lite-interpreter* ]]; then
export BUILD_LITE_INTERPRETER=1
build_lite_interpreter
else
compile_x86_64
fi
if which sccache > /dev/null; then
print_sccache_stats
fi

View File

@ -18,6 +18,9 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(
fi
popd
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
# This environment variable makes ProcessGroupGloo default to
@ -39,6 +42,16 @@ test_python_all() {
assert_git_not_dirty
}
test_python_mps() {
setup_test_python
time python test/run_test.py --verbose --mps
MTL_CAPTURE_ENABLED=1 ${CONDA_RUN} python3 test/test_mps.py --verbose -k test_metal_capture
assert_git_not_dirty
}
test_python_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
@ -218,25 +231,55 @@ test_torchbench_smoketest() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local backend=eager
local dtype=notset
local device=mps
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo)
local hf_models=(GoogleFnet YituTechConvBert)
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
for backend in eager inductor; do
echo "Setup complete, launching torchbench training performance run"
for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
done
for dtype in notset float16 bfloat16; do
echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
for model in "${hf_models[@]}"; do
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
done
for dtype in notset amp; do
echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype}"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv" || true
done
done
echo "Launching torchbench inference performance run"
for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
done
echo "Pytorch benchmark on mps device completed"
@ -288,6 +331,8 @@ elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then
test_timm_perf
elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then
test_torchbench_smoketest
elif [[ $TEST_CONFIG == *"mps"* ]]; then
test_python_mps
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_python_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then

View File

@ -8,55 +8,62 @@
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
echo "Testing pytorch"
time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose
# When adding more tests, please use HUD to see which shard is shorter
if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then
# FSDP tests
for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done
fi
# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015
# python tools/download_mnist.py --quiet -d test/cpp/api/mnist
# OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api
time python test/run_test.py --verbose -i distributed/test_c10d_common
time python test/run_test.py --verbose -i distributed/test_c10d_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_nccl
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_symmetric_memory
time python test/run_test.py --verbose -i distributed/test_pg_wrapper
time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent
# FSDP tests
for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done
# ShardedTensor tests
time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint
time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint
time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec
time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard
if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then
time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose
# functional collective tests
time python test/run_test.py --verbose -i distributed/test_functional_api
# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015
# python tools/download_mnist.py --quiet -d test/cpp/api/mnist
# OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api
time python test/run_test.py --verbose -i distributed/test_c10d_common
time python test/run_test.py --verbose -i distributed/test_c10d_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_nccl
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_symmetric_memory
time python test/run_test.py --verbose -i distributed/test_pg_wrapper
time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent
# DTensor tests
time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops
time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile
# ShardedTensor tests
time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint
time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint
time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec
time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard
# DeviceMesh test
time python test/run_test.py --verbose -i distributed/test_device_mesh
# functional collective tests
time python test/run_test.py --verbose -i distributed/test_functional_api
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# DTensor tests
time python test/run_test.py --verbose -i distributed/tensor/test_random_ops
time python test/run_test.py --verbose -i distributed/tensor/test_dtensor_compile
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# DeviceMesh test
time python test/run_test.py --verbose -i distributed/test_device_mesh
# ND composability tests
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx
time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu
time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype
time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# ND composability tests
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx
time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu
time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype
time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping
fi
assert_git_not_dirty

View File

@ -1,22 +0,0 @@
#!/bin/bash
set -e
run_test () {
rm -rf test_tmp/ && mkdir test_tmp/ && cd test_tmp/
"$@"
cd .. && rm -rf test_tmp/
}
get_runtime_of_command () {
TIMEFORMAT=%R
# runtime=$( { time ($@ &> /dev/null); } 2>&1 1>/dev/null)
runtime=$( { time "$@"; } 2>&1 1>/dev/null)
if [[ $runtime == *"Error"* ]]; then
exit 1
fi
runtime=${runtime#+++ $@}
runtime=$(python -c "print($runtime)")
echo "$runtime"
}

View File

@ -1,91 +0,0 @@
import argparse
import json
import math
import sys
parser = argparse.ArgumentParser()
parser.add_argument(
"--test-name", dest="test_name", action="store", required=True, help="test name"
)
parser.add_argument(
"--sample-stats",
dest="sample_stats",
action="store",
required=True,
help="stats from sample",
)
parser.add_argument(
"--update",
action="store_true",
help="whether to update baseline using stats from sample",
)
args = parser.parse_args()
test_name = args.test_name
if "cpu" in test_name:
backend = "cpu"
elif "gpu" in test_name:
backend = "gpu"
data_file_path = f"../{backend}_runtime.json"
with open(data_file_path) as data_file:
data = json.load(data_file)
if test_name in data:
mean = float(data[test_name]["mean"])
sigma = float(data[test_name]["sigma"])
else:
# Let the test pass if baseline number doesn't exist
mean = sys.maxsize
sigma = 0.001
print("population mean: ", mean)
print("population sigma: ", sigma)
# Let the test pass if baseline number is NaN (which happened in
# the past when we didn't have logic for catching NaN numbers)
if math.isnan(mean) or math.isnan(sigma):
mean = sys.maxsize
sigma = 0.001
sample_stats_data = json.loads(args.sample_stats)
sample_mean = float(sample_stats_data["mean"])
sample_sigma = float(sample_stats_data["sigma"])
print("sample mean: ", sample_mean)
print("sample sigma: ", sample_sigma)
if math.isnan(sample_mean):
raise Exception("""Error: sample mean is NaN""") # noqa: TRY002
elif math.isnan(sample_sigma):
raise Exception("""Error: sample sigma is NaN""") # noqa: TRY002
z_value = (sample_mean - mean) / sigma
print("z-value: ", z_value)
if z_value >= 3:
raise Exception( # noqa: TRY002
f"""\n
z-value >= 3, there is high chance of perf regression.\n
To reproduce this regression, run
`cd .ci/pytorch/perf_test/ && bash {test_name}.sh` on your local machine
and compare the runtime before/after your code change.
"""
)
else:
print("z-value < 3, no perf regression detected.")
if args.update:
print("We will use these numbers as new baseline.")
new_data_file_path = f"../new_{backend}_runtime.json"
with open(new_data_file_path) as new_data_file:
new_data = json.load(new_data_file)
new_data[test_name] = {}
new_data[test_name]["mean"] = sample_mean
new_data[test_name]["sigma"] = max(sample_sigma, sample_mean * 0.1)
with open(new_data_file_path, "w") as new_data_file:
json.dump(new_data, new_data_file, indent=4)

View File

@ -1,18 +0,0 @@
import json
import sys
import numpy
sample_data_list = sys.argv[1:]
sample_data_list = [float(v.strip()) for v in sample_data_list]
sample_mean = numpy.mean(sample_data_list)
sample_sigma = numpy.std(sample_data_list)
data = {
"mean": sample_mean,
"sigma": sample_sigma,
}
print(json.dumps(data))

View File

@ -1,43 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_cpu_speed_mini_sequence_labeler () {
echo "Testing: mini sequence labeler, CPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/benchmark.git
cd benchmark/
git checkout 726567a455edbfda6199445922a8cfee82535664
cd scripts/mini_sequence_labeler
SAMPLE_ARRAY=()
NUM_RUNS=$1
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python main.py)
SAMPLE_ARRAY+=("${runtime}")
done
cd ../../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_cpu_speed_mini_sequence_labeler "$@"
fi

View File

@ -1,45 +0,0 @@
#!/bin/bash
set -e
. ./common.sh
test_cpu_speed_mnist () {
echo "Testing: MNIST, CPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/pytorch/examples.git -b perftests
cd examples/mnist
conda install -c pytorch torchvision-cpu
# Download data
python main.py --epochs 0
SAMPLE_ARRAY=()
NUM_RUNS=$1
for (( i=1; i<=NUM_RUNS; i++ )) do
runtime=$(get_runtime_of_command python main.py --epochs 1 --no-log)
echo "$runtime"
SAMPLE_ARRAY+=("${runtime}")
done
cd ../..
stats=$(python ../get_stats.py "${SAMPLE_ARRAY[@]}")
echo "Runtime stats in seconds:"
echo "$stats"
if [ "$2" == "compare_with_baseline" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}"
elif [ "$2" == "compare_and_update" ]; then
python ../compare_with_baseline.py --test-name "${FUNCNAME[0]}" --sample-stats "${stats}" --update
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_cpu_speed_mnist "$@"
fi

View File

@ -1,29 +0,0 @@
#!/bin/bash
. ./common.sh
test_cpu_speed_torch () {
echo "Testing: torch.*, CPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/yf225/perf-tests.git
if [ "$1" == "compare_with_baseline" ]; then
export ARGS=(--compare ../cpu_runtime.json)
elif [ "$1" == "compare_and_update" ]; then
export ARGS=(--compare ../cpu_runtime.json --update ../new_cpu_runtime.json)
elif [ "$1" == "update_only" ]; then
export ARGS=(--update ../new_cpu_runtime.json)
fi
if ! python perf-tests/modules/test_cpu_torch.py "${ARGS[@]}"; then
echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
exit 1
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_cpu_speed_torch "$@"
fi

View File

@ -1,29 +0,0 @@
#!/bin/bash
. ./common.sh
test_cpu_speed_torch_tensor () {
echo "Testing: torch.Tensor.*, CPU"
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
git clone https://github.com/yf225/perf-tests.git
if [ "$1" == "compare_with_baseline" ]; then
export ARGS=(--compare ../cpu_runtime.json)
elif [ "$1" == "compare_and_update" ]; then
export ARGS=(--compare ../cpu_runtime.json --update ../new_cpu_runtime.json)
elif [ "$1" == "update_only" ]; then
export ARGS=(--update ../new_cpu_runtime.json)
fi
if ! python perf-tests/modules/test_cpu_torch_tensor.py "${ARGS[@]}"; then
echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
exit 1
fi
}
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
run_test test_cpu_speed_torch_tensor "$@"
fi

Some files were not shown because too many files have changed in this diff Show More