Commit Graph

1730 Commits

Author SHA1 Message Date
99f2491af9 Revert "Use absolute path path.resolve() -> path.absolute() (#129409)"
This reverts commit 45411d1fc9a2b6d2f891b6ab0ae16409719e09fc.

Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, @albanD please help get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2571316444))
2025-01-04 14:17:20 +00:00
417d9c3522 [Inductor/Triton] Upcast FP16/BF16 math reductions to FP32 (#141052)
Summary:
Triton compiler does not automatically promote fp16/bf16 reductions to fp32  accumulation. This will result in significant accuracy issue.

This diff will upcast the input to FP32 for all math reductions `["welford_reduce", "welford_combine", "prod", "sum", "xor_sum"]`

Test Plan:
CI
```
python test/inductor/test_torchinductor.py TritonCodeGenTests.test_low_precision_reduction
```

Differential Revision: D65965032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141052
Approved by: https://github.com/blaine-rister
2025-01-04 07:57:10 +00:00
b5b1e9456a [MPSInductor] Add masked implementation (#144084)
More or less borrowed from
22580f160e/torch/_inductor/codegen/halide.py (L549-L563)

`pytest test/inductor/test_torchinductor.py -k _mps` score is 408 failed, 347 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144084
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #144167, #144162, #144083
2025-01-04 04:30:07 +00:00
479d6f2199 [mps/inductor] Add support for log(). (#144169)
Tested via:

```
 % pytest test/inductor/test_mps_basic.py
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144169
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-04 03:07:56 +00:00
464b50dbd7 [MPSInductor] Add floor_div and index_expr implementation (#144083)
Simply copy-n-pasted from CPPInductor

`pytest test/inductor/test_torchinductor.py -k _mps` score is 418 failed, 337 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144083
Approved by: https://github.com/jansel
ghstack dependencies: #144167, #144162
2025-01-04 01:10:01 +00:00
6d25938540 [MPSInductor] Add remainder op (#144162)
For it to return correct result for half precision type it must be
upcast to float

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144162
Approved by: https://github.com/jansel
ghstack dependencies: #144167
2025-01-04 00:47:40 +00:00
f8e1eacf2f [MPSInductor] Extend constant to bool type (#144167)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144167
Approved by: https://github.com/jansel
2025-01-04 00:47:40 +00:00
d41134f7e5 [Inductor] Fix torch.polygamma() when n == 0 (#144058)
Fixes #143648

aten:

dec1a6d0f0/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L436-L447)

compiled kernel code:

```
cpp_fused_polygamma_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_devuser/tmpi1d9ksww/db/cdb7hyptwxpzukwd42x4ajfjlgrpum4a4htdd6lhb65apclsmno4.h"
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        {
            {
                auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(0.0);
                auto tmp2 = tmp1 == 0 ? calc_digamma(tmp0) : calc_polygamma(tmp0, tmp1);
                out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144058
Approved by: https://github.com/jansel
2025-01-04 00:22:10 +00:00
45411d1fc9 Use absolute path path.resolve() -> path.absolute() (#129409)
Changes:

1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()`
2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409
Approved by: https://github.com/albanD
2025-01-03 20:03:40 +00:00
ad09395674 [MPSInductor] Fix multi rangevar kernel invocation (#144050)
By changing `thread_position_in_grid` type to uint{n} and passing
dimentions during the kernel call

`pytest test/inductor/test_torchinductor.py -k _mps` score is 445 failed, 309 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144050
Approved by: https://github.com/jansel
ghstack dependencies: #144055, #144051, #144122, #144105, #144156
2025-01-03 19:32:43 +00:00
52e107a7ca [MPSInductor] Add constant, isinf and isnan ops (#144156)
Per Table 6.5 of [Metal Language Specification](https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) infinity is `HUGE_VALF`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144156
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #144055, #144051, #144122, #144105
2025-01-03 19:32:43 +00:00
56f6289f6a [mps/inductor] Add support for atanh(). (#144121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144121
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-03 18:55:05 +00:00
a7b61c5b49 [MPSInductor] Add signbit op support (#144105)
By mapping it to `metal::signbit`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144105
Approved by: https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #144055, #144051, #144122
2025-01-03 18:34:46 +00:00
60fe8a65af [Inductor] Generalize tiling algorithm to handle fused reductions (#144041)
# Issue

This PR cleans up an edge case that wasn't handled by https://github.com/pytorch/pytorch/pull/137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges.

# Fix

Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel.

Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with  a node of ranges `([8], [])` when we have `reduction_numel=8`.

Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND.

# Test plan

This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144041
Approved by: https://github.com/jansel
2025-01-03 18:16:27 +00:00
e93f625d00 [AOTI] don't codegen autotune_at_compile_time for non-Triton kernels (#143990)
`autotune_at_compile_time` is a separate codegen file specifically for autotuning Triton kernels. We can skip it for non-Triton kernels (like CUTLASS).

This test (test_aoti_workspace_ptr) checks that `workspace_0.data_ptr()` is codegen-ed correctly in AOTI.

```
// in AOTI codegen
kernels.cuda_fused_0(
  (const half*)arg0_1.data_ptr(), (const half*)arg1_1.data_ptr(), (half*)buf0.data_ptr(),
  (int)200, (int)5216, (int)10432, (int)10432, (int)5216, (int)0, (int)5216,
  (size_t*)nullptr, (uint8_t*)workspace_0.data_ptr(), stream);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143990
Approved by: https://github.com/henrylhtsang, https://github.com/chenyang78, https://github.com/desertfire
2025-01-03 18:01:12 +00:00
e88d06f54e ir.ExternKernel: correctly handle kwarg default arguments (#141371)
Additionally, enable torchinductor opinfo tests exercising all
previously fixed bugs in this stack.

Note: I've manually sharded the cpp_wrapper CI checks into 2 shards.
Once all OpInfo tests are enabled we should switch back to automatic
sharding, but until then the pipeline doesn't have appropriate timing
stats.  More shards would be helpful given the compilation slowdown
associated with cpp_wrapper, but 2 will do for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141371
Approved by: https://github.com/desertfire
2025-01-03 16:05:31 +00:00
f7644efa79 [MPSInductor][EZ] Fix logical_[or|end] ops (#144122)
For boolean operands it does not really matter whether `&` or `&&` is
used, but if one ever to rely on operator precedence, then bitwise ops
should have higher precendence than logical ones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144122
Approved by: https://github.com/huydhn
ghstack dependencies: #144055, #144051
2025-01-03 15:28:07 +00:00
b336d72dae [MPSInductor] Preserve dtype during load (#144051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144051
Approved by: https://github.com/Skylion007
ghstack dependencies: #144055
2025-01-03 15:17:33 +00:00
c09bf71bd6 [Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848)
Fix https://github.com/pytorch/pytorch/issues/143568
Before:
![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003)
After:
![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-01-03 09:00:43 +00:00
a8c98ce175 [cutlass-3] Update third-party/cutlass-3 from 3.4 to 3.5.1 (#143515)
# Summary:

This also makes updates to different repositories throughout FB code to roll any updates needed for this new release.

I was not able to get AsyncMM.cu to build (still trying) Yfiu suggested that I just skip it for now

Test Plan:
Have run various build commands to try and expose errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143515
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-02 18:45:11 +00:00
eec30916e7 Revert "Update low prec codegen for div/mod (#142350)"
This reverts commit 135a2d44830b2de1ed6714f52cc6a612406adb6d.

Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2566615835))
2024-12-31 17:35:32 +00:00
5ef0de7615 [MPSInductor] Fix multiple kernel generation (#143998)
At the moment by generating multiple MetalLibraries

`pytest test/inductor/test_torchinductor.py -k _mps` score is 434 failed, 317 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143998
Approved by: https://github.com/jansel, https://github.com/ruidazeng
ghstack dependencies: #143948, #143949, #143973, #143977
2024-12-31 13:51:50 +00:00
f0f09bb3c2 [MPSInductor] Implement minimum and maximum ops (#143977)
By calling `metal::min` and `metal::max` respectively with argument
typecast to a common type to avoid ambiguous calls errors

TODO: Implement NaN propagation for both eager and compile, see https://github.com/pytorch/pytorch/issues/143976

`pytest test/inductor/test_torchinductor.py -k _mps` score is 460 failed, 291 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143977
Approved by: https://github.com/jansel
ghstack dependencies: #143948, #143949, #143973
2024-12-31 13:51:50 +00:00
01034e963c [AOTI] Not use AOTI_TORCH_CHECK in non AOTI mode. (#143970)
Fix #143967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143970
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-31 06:28:32 +00:00
a2753e376b [Inductor] Support tiling reduction dimensions (#137243)
Fixes #134277 and https://github.com/pytorch/pytorch/issues/142317.

Sub-PRs containing refactors from this one:
 - https://github.com/pytorch/pytorch/pull/141733
 - https://github.com/pytorch/pytorch/pull/141738
 - https://github.com/pytorch/pytorch/pull/141751 (based off the former)
 - https://github.com/pytorch/pytorch/pull/142249
 - https://github.com/pytorch/pytorch/pull/142020
 - https://github.com/pytorch/pytorch/pull/143135

 These refactor PRs should land before the main one.

# Feature

*Note: to minimize risk, multi-dimensional reductions are gated by the flag `config.triton.tile_reductions`, which defaults to False.*

Instead of having a single reduction dimension called `"r"`, we can now support 2D reductions with `"r0_"` and `"r1_"` dimensions. 2D reductions generate two nested loops, with different block pointer advancements in each loop body. Most of the implementation is generic to ND reductions, but for now the tiling algorithm sets a hard limit at 2D.

Here's an example of a 2D persistent reduction kernel:
```
@triton.jit
def triton_per_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr):
    xnumel = 1
    r0_numel = 15
    R0_BLOCK: tl.constexpr = 16
    r1_numel = 15
    R1_BLOCK: tl.constexpr = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1)
    r0_index = tl.arange(0, R0_BLOCK)[None, :, None]
    r0_offset = 0
    r0_mask = r0_index < r0_numel
    r1_index = tl.arange(0, R1_BLOCK)[None, None, :]
    r1_offset = 0
    r1_mask = r1_index < r1_numel
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    roffset = r1_offset + (r0_offset*r1_numel)
    rindex = r1_index + (r0_index*r1_numel)
    r0_0 = r0_index
    r1_1 = r1_index
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[15, 15], strides=[30, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[r0_offset, r1_offset]), boundary_check=[0, 1], padding_option='zero')[None, :, :]
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
    tmp3 = tl.where(r0_mask & r1_mask, tmp1, 0)
    tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])
    tmp5 = tl.sum(tmp4, 1)[:, None, None]
    tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp5, None)
''', device_str='cuda')
```

There are a few main differences between this kernel and what Inductor would generate without this PR.
 - Instead of an `r`/`RBLOCK` dimension, we have two reduction dimensions: `r0_`/`R0_BLOCK` and `r1_`/`R1_BLOCK`.
 - There are special size and indexing variables for reductions, which don't directly correspond to any kernel dimension. (`rindex`, `rnumel`, `RBLOCK`, and `roffset`.) These collapse N-D reduction sizes and indices indices into 1D. This simplifies the codegen for reductions, which sometimes want to access linear indices instead of N-dimensional ones. Doing things this way allows us to generate N-D loads and stores, but access this data as if it were 1D, minimizing the blast radius of this PR. Although this makes the code more verbose, it shouldn't have a perf impact because the triton compiler eliminates dead code.
 - We generate the line `tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])` before performing the actual reduction. This reshapes N reduction dimensions into 1D. This allows us to reduce over all N dimensions at once, simplifying the codegen and allowing the Triton complier to decide the order of processing under the hood.

Here's an example of a looped reduction:
```
@triton.jit
def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr):
    xnumel = 3
    r0_numel = 43
    r1_numel = 129
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = xindex < xnumel
    r0_base = tl.arange(0, R0_BLOCK)[None, :, None]
    r1_base = tl.arange(0, R1_BLOCK)[None, None, :]
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    rbase = r1_base + (r0_base*r1_numel)
    x0 = xindex
    block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[3, 43, 129], strides=[11094, 258, 1], block_shape=[XBLOCK, R0_BLOCK, R1_BLOCK], order=[2, 1, 0], offsets=[xoffset, 0, 0])
    _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 0, tl.float32)
    for r0_offset in range(0, r0_numel, R0_BLOCK):
        r0_index = r0_offset + r0_base
        r0_mask = r0_index < r0_numel
        for r1_offset in range(0, r1_numel, R1_BLOCK):
            r1_index = r1_offset + r1_base
            r1_mask = r1_index < r1_numel
            roffset = r1_offset + (r0_offset*r1_numel)
            rindex = r1_index + (r0_index*r1_numel)
            r0_1 = r0_index
            r1_2 = r1_index
            tmp0 = tl.load(block_ptr0, boundary_check=[0, 1, 2], padding_option='zero', eviction_policy='evict_first')
            tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
            tmp3 = _tmp2 + tmp1
            _tmp2 = tl.where(r0_mask & r1_mask & xmask, tmp3, _tmp2)
            block_ptr0 = tl.advance(block_ptr0, [0, 0, R1_BLOCK])
        block_ptr0 = tl.advance(block_ptr0, [0, R0_BLOCK, (-1)*R1_BLOCK*((128 + R1_BLOCK) // R1_BLOCK)])
    tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK])
    tmp2 = tl.sum(tmp4, 1)[:, None, None]
    tl.store(tl.make_block_ptr(out_ptr0, shape=[3], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.reshape(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

In addition to the aforementioned changes to the persistent reduction, multidimensional looped reductions have a few more lines of code:
 - They calculate indices inside the loop using `r0_base` and `r1_base`. For compatibility with existing codegen, these are collapsed to the 1D variant `rbase`.
 - Block pointer advancements are more nuanced for multidimensional loops. At the end of each loop body, we emit a `tl.advance` line which not only increments the pointer in its own dimension, but also undoes the cumulative increments of the previous loop level. This is equivalent to the usual practice in nested loops of starting with a fresh iteration variable at each level. Implementing this required refactoring the way we generate pointer advancements into a new `self.pointer_advancements` field of the kernel, which categorizes advancements by dimension.

The biggest difficulty in implementing this feature was that we represented tiling with a tuple like `(5,2)`. In the existing codebase, the compiler can infer that the reduction dimension of `(5,2)` is `2`, since reductions are always the last dimension. This became cumbersome now that we have to support multiple reduction dimensions, so I refactored tiling into a dict like `{"x": 5, "r0_": 2, "r1_": 4}`. This required quite a few code changes, but I don't think it makes the underlying logic much more complex. This will also make it easier to eventually support simultaneous pointwise and reduction tiling, like `{"x": 5, "y": 5, "r0_": 2, "r1_": 4}`. (This is not supported today, but we might want to do it eventually.)

The existing tiling algorithm generalized naturally to support reductions. For pointwise kernels, we tile the pointwise dimensions (`"x"`, `"y"`) as is. For reduction kernels, we never tile the `"x"` dimension, and only tile the reduction dimensions (`"r0_"`, `"r1_"`). Thus we only ever tile pointwise OR reduction dimensions, but not both. In principle it seems possible to support both, but it would likely require changes to the kernel fusion and autotuning logic. I thought it best to keep this PR as minimal as possible since it already touched a lot of different files.

Unfortunately, these changes weren't enough to get block pointers in some seemingly simple test cases. In some tests for `argmax` and `var_mean`, we already collapse reduction dimensions into 1D and generate modular indexing expressions, prior to tiling. So it's not trivial to figure out how to expand the collapsed reduction dimension back to a shape that would simplify the indexing.

To address these cases, this PR adds a new feature to the `config.prefer_nd_tiling` option, which analyzes reads and writes in the kernel, using the same mod-div pattern matching logic that generates block pointers later on. By matching this pattern, we can solve for the tiling splits which *would* simplify the indexing expression, and use then use that tiling to eliminate the modular indexing and emit a block pointer. This tiling mode is still off by default, but it's important for certain applications where we need to get as many block pointers as possible.

# Test plan

This touches pretty much anything that uses the Triton and Halide backends, so the existing CI provides good coverage. However, 2D reductions are gated behind a few feature flags like `config.prefer_nd_tiling` and `config.tile_reductions`, so this really only checks that the PR doesn't break 1D reductions.

In addition to existing CI tests, this PR also adds some new tests that specifically stress 2D reductions:

- `test_2d_reduction_odd_shapes`: test 2D reductions with a variety of ops and sizes. This covers the typical persistent and looped reductions.
-  `test_2d_reduce_no_x_dim`: test 2D reductions with no x dimension.
-  `test_2d_welford_reduction`: test 2D welford reductions with block pointers.
- `test_welford_non_block_pointer`: test a 2D welford reduction when block pointer analysis fails.
- `test_reduction_multiple_discontiguous_dims`: test reducing over more than one discontiguous dimension. We won't get a block pointer for this case, since that would require 3D tiling, but we're currently limited to 2D.
- `test_2d_reduction_multi_kernel`: test multi kernel autotuning on a 2D softmax kernel.
- `test_enable_tiled_reductions`: test that `config.triton.tile_reductions` enables/disables this feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137243
Approved by: https://github.com/jansel

Co-authored-by: Yueming Hao <yhao@meta.com>
Co-authored-by: Jason Ansel <jansel@meta.com>
2024-12-31 05:06:46 +00:00
f3e5078c27 [Inductor] Relax size constraints for re-inplacing (#143884)
Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing.

This PR changes the size constraints to be:
- input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str).
- it's statically known that 0.99 * input_size <= output_size <= input_size

### Apply on llm.c
See the reuse of `buf1`.
Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078))
![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9)

After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053))
![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884
Approved by: https://github.com/eellison
2024-12-31 03:52:47 +00:00
11bb94b7ea [MPSInductor] Fix index generation for transpose (#143973)
Alas, PythonPrinter would not work here, not would CppPrinter, so start building MetalPrinter.

`pytest test/inductor/test_torchinductor.py -k _mps` score is 474 failed, 277 passed, 32 skipped
Before this change:
`pytest test/inductor/test_torchinductor.py -k _mps` reported 506 failed, 245 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143973
Approved by: https://github.com/jansel
ghstack dependencies: #143948, #143949
2024-12-31 02:04:50 +00:00
934eaa503f [Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-30 23:51:17 +00:00
2da7fb5320 [inductor] Make generated kernels deterministic (#143951)
`"compile_id"` had slipped into our generated Triton code (in the
metadata), which will defeat caching because the same kernels generated
in a different order would not cache hit with eachother.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951
Approved by: https://github.com/oulgen
2024-12-30 23:35:11 +00:00
d260bc4476 cpp_wrapper: minimize pybind11 dependency (#143772)
Only include the parts of `pybind11` that handle GIL management within `cpp_wrapper`. This dramatically improves compilation times by reducing the number of headers we compile. Improvements on my local system are on the order of 2x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143772
Approved by: https://github.com/Skylion007
2024-12-30 20:41:02 +00:00
1b0d19a2cb Revert "[inductor] Make generated kernels deterministic (#143951)"
This reverts commit 79b354ee37b7d8a06a48ca8cc4e19a3fd006b433.

Reverted https://github.com/pytorch/pytorch/pull/143951 on behalf of https://github.com/wdvr due to failing tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/143951#issuecomment-2564952267))
2024-12-30 02:06:38 +00:00
79b354ee37 [inductor] Make generated kernels deterministic (#143951)
`"compile_id"` had slipped into our generated Triton code (in the
metadata), which will defeat caching because the same kernels generated
in a different order would not cache hit with eachother.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951
Approved by: https://github.com/oulgen
2024-12-29 19:53:33 +00:00
d8c3900d80 [Inductor] Implement primitive Metal compiler (#143893)
Still work in progress, only works for element wise operations. Current implementation could be used to turn something like
```python
def f(x):
  return x[:,::2].sin() + x[:, 1::2].cos()
```
into the following shader
```python
# Topologically Sorted Source Nodes: [sin, cos, add], Original ATen: [aten.sin, aten.cos, aten.add]
# Source node to ATen node mapping:
#   add => add
#   cos => cos
#   sin => sin
# Graph fragment:
#   %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%slice_2,), kwargs = {})
#   %cos : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%slice_4,), kwargs = {})
#   %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%sin, %cos), kwargs = {})
mps_lib = torch.mps._compile_shader("""
    kernel void kernel_0(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp0 = in_ptr0[2*x0];
        auto tmp1 = metal::precise::sin(tmp0);
        auto tmp2 = in_ptr0[2*x0 + 1];
        auto tmp3 = metal::precise::cos(tmp2);
        auto tmp4 = tmp1 + tmp3;
        out_ptr0[x0] = static_cast<float>(tmp4);
    }
""")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143893
Approved by: https://github.com/jansel
ghstack dependencies: #143891, #143892
2024-12-28 06:58:32 +00:00
74028cfd0c [Inductor][CPP] Fix Data Type issue of frexp (#143746)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746
Approved by: https://github.com/jgong5
2024-12-28 06:00:13 +00:00
4a7cf0dbff [Inductor] Add MPS device op overrides (#143892)
Mostly dummy interface as MPS backend currently limited to a single device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143892
Approved by: https://github.com/jansel
ghstack dependencies: #143891
2024-12-28 02:11:45 +00:00
b54620f40f [CUTLASS] fix bugs: extra data_ptr() call, wrong size symbol name, bias symbol not added (#143528)
A few small things in this PR:
- fixed a bug where `workspace.data_ptr().data_ptr()` showed up
- for SM80 CUTLASS kernels, the symbol size for W.size(1) was never created
- for addmm kernels, the ldc bias symbol never showed up

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143528
Approved by: https://github.com/henrylhtsang
2024-12-27 23:38:18 +00:00
c17d767686 remove allow-untyped-defs from _inductor/codegen/rocm/rocm_template_buffer.py (#143870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143870
Approved by: https://github.com/aorenste, https://github.com/Skylion007
2024-12-27 23:28:51 +00:00
63d6e1f743 remove allow-untyped-defs from _inductor/codegen/aoti_hipify_utils.py (#143916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143916
Approved by: https://github.com/Skylion007
2024-12-27 23:25:37 +00:00
969415885d [inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373
Approved by: https://github.com/eellison
2024-12-27 06:46:09 +00:00
d60282c177 remove allow-untyped-defs from _inductor/codegen/cpu_device_op_overrides.py (#143881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143881
Approved by: https://github.com/aorenste
2024-12-27 04:10:47 +00:00
cc4e70b7c3 Revert "Use absolute path path.resolve() -> path.absolute() (#129409)"
This reverts commit 135c7db99d646b8bd9603bf969d47d3dec5987b1.

Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/malfet due to need to revert to as dependency of https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2562969825))
2024-12-26 17:26:06 +00:00
844e6108f6 Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)"
This reverts commit ad750ae32079020f51f9b7d01237f3ecfa83b6ff.

Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))
2024-12-24 17:22:57 +00:00
b77406a9ec [BE][CI] bump ruff to 0.8.4 (#143753)
Changes:

1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
2024-12-24 12:24:10 +00:00
135c7db99d Use absolute path path.resolve() -> path.absolute() (#129409)
Changes:

1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()`
2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409
Approved by: https://github.com/albanD
2024-12-24 08:33:08 +00:00
ad750ae320 [Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-24 05:42:36 +00:00
cyy
1feae27ed6 [16/N] Fix extra warnings brought by clang-tidy-17 (#143714)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143714
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-12-24 03:29:38 +00:00
434e0c2104 Inductor Cutlass backend: Eliminate unused code. (#143723)
Summary: Eliminates an unused file and some smaller unused code fragments from the inductor cutlass codebase.

Test Plan: CI

Differential Revision: D67579837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143723
Approved by: https://github.com/ColinPeppler
2024-12-23 09:35:03 +00:00
cyy
09c950cc87 Remove unused <ATen/core/Array.h> inclusion (#143701)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143701
Approved by: https://github.com/albanD
2024-12-22 14:30:11 +00:00
f1cbf4b1b5 Enable ruff's unused variable checking everywhere in pytorch (#136965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-12-22 02:33:11 +00:00
607884c9af [Inductor][CPP] Fix bitwise shift with corner inputs (#143635)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: 29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501) at these corner inputs.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635
Approved by: https://github.com/jgong5
2024-12-20 13:47:40 +00:00