Commit Graph

73 Commits

Author SHA1 Message Date
9aedb3cd87 [AOTI-FX] Support registering custom FX backends (#162317)
# Feature
Currently, `torch._inductor.compile_aot` always uses the `WrapperFxCodegen` class. In contrast, Python and C++ codegen allow users to register custom backends. This PR brings that feature to FX codegen.

# Test plan
Added a CI test registering a custom FX backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162317
Approved by: https://github.com/jansel
2025-09-06 07:32:03 +00:00
30edac5da6 Updates to CuTe DSL template renderer (#161117)
# Summary
This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack.

<img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" />

Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack.

A bunch of score mods that me and Claude came up with , that exercise the actual ops.
``` Py

def causal_mask(score, b, h, q_idx, kv_idx):
    """Causal attention mask."""
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

def relative_bias(score, b, h, token_q, token_kv):
    """Relative position bias."""
    return score + torch.abs(token_q - token_kv)

def relative_bias_v2(score, b, h, token_q, token_kv):
    """Relative position bias with factor of 2."""
    return score + 2 * torch.abs(token_q - token_kv)

def times_two(score, b, h, q_idx, kv_idx):
    """Simple score modification that doubles the score."""
    return score * 2

def alibi_bias(score, b, h, q_idx, kv_idx):
    """ALiBi (Attention with Linear Biases) - used in some modern models."""
    # Different slopes for different heads
    slope = 2 ** (-8 * (h + 1) / 8)  # Simplified version
    return score - slope * torch.abs(q_idx - kv_idx)

def sliding_window(score, b, h, q_idx, kv_idx, window_size=256):
    """Sliding window attention - only attend to nearby tokens."""
    return torch.where(
        torch.abs(q_idx - kv_idx) <= window_size,
        score,
        float("-inf")
    )

def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64):
    """Block diagonal attention pattern."""
    q_block = q_idx // block_size
    kv_block = kv_idx // block_size
    return torch.where(q_block == kv_block, score, float("-inf"))

def additive_bias(score, b, h, q_idx, kv_idx):
    """Test simple addition with position-based bias."""
    return score + (q_idx + kv_idx) * 0.01

def multiplicative_decay(score, b, h, q_idx, kv_idx):
    """Test multiplication with distance-based decay."""
    distance = torch.abs(q_idx - kv_idx)
    return score * torch.exp(-0.1 * distance)

def sine_wave_bias(score, b, h, q_idx, kv_idx):
    """Test trigonometric functions."""
    return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64)

def log_distance_penalty(score, b, h, q_idx, kv_idx):
    """Test logarithmic operations."""
    distance = torch.abs(q_idx - kv_idx).float()
    return score - torch.log(1 + distance)

def alternating_mask(score, b, h, q_idx, kv_idx):
    """Test with alternating pattern - good for branch prediction."""
    return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf"))

def head_specific_pattern(score, b, h, q_idx, kv_idx):
    """Different behavior per attention head."""
    even_head = h % 2 == 0
    causal = q_idx >= kv_idx
    return torch.where(even_head & causal, score, float("-inf"))

def sparse_strided(score, b, h, q_idx, kv_idx, stride=4):
    """Sparse attention with strided pattern."""
    return torch.where(
        (kv_idx % stride == 0) | (q_idx == kv_idx),
        score,
        float("-inf")
    )

def causal_with_global(score, b, h, q_idx, kv_idx):
    """Causal mask but first few tokens are globally attended."""
    is_causal = q_idx >= kv_idx
    is_global = kv_idx < 4
    return torch.where(is_causal | is_global, score, float("-inf"))

def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2):
    """Dilated attention pattern - exponentially increasing gaps."""
    distance = torch.abs(q_idx - kv_idx)
    is_attended = (distance == 0) | ((distance > 0) & ((distance & (distance - 1)) == 0))
    return torch.where(is_attended, score, float("-inf"))

```

Example outputs:
```
[Test Suite]
Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128

[Test 1: none]
[No score_mod, flash='enabled'] Found flash_attncute: True
[No score_mod, flash='disabled'] Found flash_attncute: False
✓ Outputs match between flash enabled/disabled
✓ Output matches eager SDPA (rtol=0.001, atol=0.001)

[Test 2: causal]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 17879 / 134217728 (0.0%)
Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed)
Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed)

[Test 3: rel_bias]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 12836 / 134217728 (0.0%)
Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed)
Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed)

[Test 4: rel_bias_v2]
```

This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117
Approved by: https://github.com/mlazos
2025-08-27 23:01:31 +00:00
014b98dd09 Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314)"
This reverts commit 77bc959fe122bfd131e339ca36cab445a1860806.

Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))
2025-08-27 21:21:19 +00:00
38ed57d446 Revert "Updates to CuTe DSL template renderer (#161117)"
This reverts commit 1750cc80374a9dd22fc26701c0602ae11a62baf0.

Reverted https://github.com/pytorch/pytorch/pull/161117 on behalf of https://github.com/atalman due to will need to revert to unblock revert of https://github.com/pytorch/pytorch/pull/151314 ([comment](https://github.com/pytorch/pytorch/pull/161117#issuecomment-3229754295))
2025-08-27 21:17:25 +00:00
1750cc8037 Updates to CuTe DSL template renderer (#161117)
# Summary
This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack.

<img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" />

Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack.

A bunch of score mods that me and Claude came up with , that exercise the actual ops.
``` Py

def causal_mask(score, b, h, q_idx, kv_idx):
    """Causal attention mask."""
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

def relative_bias(score, b, h, token_q, token_kv):
    """Relative position bias."""
    return score + torch.abs(token_q - token_kv)

def relative_bias_v2(score, b, h, token_q, token_kv):
    """Relative position bias with factor of 2."""
    return score + 2 * torch.abs(token_q - token_kv)

def times_two(score, b, h, q_idx, kv_idx):
    """Simple score modification that doubles the score."""
    return score * 2

def alibi_bias(score, b, h, q_idx, kv_idx):
    """ALiBi (Attention with Linear Biases) - used in some modern models."""
    # Different slopes for different heads
    slope = 2 ** (-8 * (h + 1) / 8)  # Simplified version
    return score - slope * torch.abs(q_idx - kv_idx)

def sliding_window(score, b, h, q_idx, kv_idx, window_size=256):
    """Sliding window attention - only attend to nearby tokens."""
    return torch.where(
        torch.abs(q_idx - kv_idx) <= window_size,
        score,
        float("-inf")
    )

def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64):
    """Block diagonal attention pattern."""
    q_block = q_idx // block_size
    kv_block = kv_idx // block_size
    return torch.where(q_block == kv_block, score, float("-inf"))

def additive_bias(score, b, h, q_idx, kv_idx):
    """Test simple addition with position-based bias."""
    return score + (q_idx + kv_idx) * 0.01

def multiplicative_decay(score, b, h, q_idx, kv_idx):
    """Test multiplication with distance-based decay."""
    distance = torch.abs(q_idx - kv_idx)
    return score * torch.exp(-0.1 * distance)

def sine_wave_bias(score, b, h, q_idx, kv_idx):
    """Test trigonometric functions."""
    return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64)

def log_distance_penalty(score, b, h, q_idx, kv_idx):
    """Test logarithmic operations."""
    distance = torch.abs(q_idx - kv_idx).float()
    return score - torch.log(1 + distance)

def alternating_mask(score, b, h, q_idx, kv_idx):
    """Test with alternating pattern - good for branch prediction."""
    return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf"))

def head_specific_pattern(score, b, h, q_idx, kv_idx):
    """Different behavior per attention head."""
    even_head = h % 2 == 0
    causal = q_idx >= kv_idx
    return torch.where(even_head & causal, score, float("-inf"))

def sparse_strided(score, b, h, q_idx, kv_idx, stride=4):
    """Sparse attention with strided pattern."""
    return torch.where(
        (kv_idx % stride == 0) | (q_idx == kv_idx),
        score,
        float("-inf")
    )

def causal_with_global(score, b, h, q_idx, kv_idx):
    """Causal mask but first few tokens are globally attended."""
    is_causal = q_idx >= kv_idx
    is_global = kv_idx < 4
    return torch.where(is_causal | is_global, score, float("-inf"))

def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2):
    """Dilated attention pattern - exponentially increasing gaps."""
    distance = torch.abs(q_idx - kv_idx)
    is_attended = (distance == 0) | ((distance > 0) & ((distance & (distance - 1)) == 0))
    return torch.where(is_attended, score, float("-inf"))

```

Example outputs:
```
[Test Suite]
Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128

[Test 1: none]
[No score_mod, flash='enabled'] Found flash_attncute: True
[No score_mod, flash='disabled'] Found flash_attncute: False
✓ Outputs match between flash enabled/disabled
✓ Output matches eager SDPA (rtol=0.001, atol=0.001)

[Test 2: causal]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 17879 / 134217728 (0.0%)
Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed)
Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed)

[Test 3: rel_bias]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 12836 / 134217728 (0.0%)
Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed)
Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed)

[Test 4: rel_bias_v2]
```

This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117
Approved by: https://github.com/mlazos
2025-08-27 18:39:09 +00:00
77bc959fe1 Add inductor backend to device interface; make minifier_tests more device agnostic (#151314)
Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314
Approved by: https://github.com/eellison
2025-08-26 19:40:37 +00:00
50f23ff6f8 rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883)
Fixes #159399
"Modified torch.testing._internal.inductor_utils and test/inductor"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883
Approved by: https://github.com/janeyx99
2025-08-08 15:44:52 +00:00
8cb91e20bc Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908)
This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908
Approved by: https://github.com/janeyx99, https://github.com/guangyey
2025-08-07 11:24:44 +00:00
d3d9bc1c31 [inductor] Allow backends to register their own custom config object (#158254)
An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254
Approved by: https://github.com/eellison
2025-07-23 15:56:06 +00:00
cec2977ed2 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-23 02:57:34 +00:00
3f44fdc03d Revert "[BE][6/16] fix typos in torch/ (#156316)"
This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4.

Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
b210cf1ea5 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-22 08:43:33 +00:00
ce79056471 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-14 17:29:54 +00:00
0e2013a12d Add helion x pt2 test (#155513)
This kinda just worked out of the box, shocking. PT2 traced into helion and emitted it as a user defined triton kernel: P1836496774

In the long run, we do not actually want this, but rather to create a helion HOP so we can do fusions etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155513
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-06-11 07:08:06 +00:00
79bdafe5b6 Revert "Custom FX pass for inductor's backend registration (#154841)"
This reverts commit e694280d1215caf70f41575f2611bfa26c69ebdb.

Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))
2025-06-09 16:56:45 +00:00
e694280d12 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-06 06:49:44 +00:00
423fc671e9 [Cutlass] Support float8_e4m3fn GEMM (#153890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153890
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-05-22 08:37:33 +00:00
f813d64f54 cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
ghstack dependencies: #150671, #150672
2025-04-07 14:20:06 +00:00
cf7447ae99 Revert "cpp_wrapper: Fix even more tests (#147225)"
This reverts commit d25acac357ff8663a7787e57e6bc5e69987a8f9a.

Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke test internally test/inductor/test_benchmark_fusion ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2761944564))
2025-03-28 17:07:52 +00:00
d25acac357 cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
2025-03-27 19:21:03 +00:00
1b373f6cd4 Revert "cpp_wrapper: Fix even more tests (#147225)"
This reverts commit 62d351a35b1bd961afbd09057beec14ff201c41d.

Reverted https://github.com/pytorch/pytorch/pull/147225 on behalf of https://github.com/yangw-dev due to broke [ROCM mi300 test](https://github.com/pytorch/pytorch/actions/runs/14066803692/job/39393110086) in [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm-mi300%20%2F%20linux-focal-rocm6.3-py3.10%20%2F%20test%20(default%2C%201%2C%206%2C%20linux.rocm.gpu.mi300.2)&mergeLF=true) ([comment](https://github.com/pytorch/pytorch/pull/147225#issuecomment-2752799778))
2025-03-26 00:03:13 +00:00
62d351a35b cpp_wrapper: Fix even more tests (#147225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147225
Approved by: https://github.com/desertfire
ghstack dependencies: #146706
2025-03-25 17:58:40 +00:00
037d7af778 [Inductor UT] Enable PYTORCH_TESTING_DEVICE_ONLY_FOR test case filter for test_torchinductor.py (#149023)
The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices
in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to
make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR.
eg. Only enable GPU cases, not CPU cases even HAS_CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-13 05:15:28 +00:00
9befdf565a [Break XPU][Inductor UT] Set input tensors to corresponding device for test case in test_aot_indutor.py (#145248)
Fix #145247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145248
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #146762
2025-02-14 01:39:11 +00:00
a1bfb39a31 [Inductor] Expand Identity ops prior to block pattern matching (#146000)
# Feature

Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s.

This PR adds a few features to achieve this invariance.
 - Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression.
 -  Preprocess the expression with this expansion prior to pattern matching.
 - Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`.

# Test plan
This PR adds a few new unit tests:
 - Added a unit test specifically for `expr.expand(identity=True)`.
 - Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops.

I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-08 18:11:53 +00:00
bc0191802f [inductor] add size-asserts for fallback ops (#145904)
Fix https://github.com/pytorch/pytorch/issues/144717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904
Approved by: https://github.com/jansel
2025-02-07 18:44:32 +00:00
6a985d8b2e Make inductor_utils.requires_gpu accept MPS (#145156)
Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU
(Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped)

- Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2`  otherwise they GPU tests are just running for _cpu suffixes.
- Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0`
- UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU
- Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel
2025-02-06 01:14:36 +00:00
cfbb27462e Revert "[inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373)"
This reverts commit b8087747f5ca7be0d37b1ac85dc0894f6a33e3a3.

Reverted https://github.com/pytorch/pytorch/pull/145373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145373#issuecomment-2619674197))
2025-01-28 17:46:11 +00:00
b8087747f5 [inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373)
Differential Revision: D68278174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145373
Approved by: https://github.com/Skylion007
2025-01-24 17:59:13 +00:00
b6941d4e42 [inductor] fix autotuning memory usage (#145410)
We use `cpu_tensor.copy_(gpu_tensor)` to clone mutated kernel arguments for autotuning. The purpose is to avoid increasing peak memory due to the clone. But if `gpu_tensor` is not contiguous, this `copy_` will need allocate an temporary tensor on GPU to store a contiguous copy of `gpu_tensor`:

6e53588789/aten/src/ATen/native/cuda/Copy.cu (L322-L334)

Here is a standalone script to illustrate this behavior: https://gist.github.com/shunting314/812a848dc67b1d674ae42415a7a462c8 . The script report 6GB rather than 3GB peak memory usage.

Note that, with all the following efforts
1. donated buffer
2. inplace padding
3. this PR

We save 3GB peak memory (18.6GB -> 15.5GB) for GPT2 model for torch.compile.

The peak memory of GPT2 is like a '...\_M\_...' shape. There are 2 places that we reach the peak. Donated buffer remove the first peak by computing grad_softmax inplace, and inplace padding removes the second peak by not allocating an extra buffer for mm-padding.

Before all these optimizations, the peak memory is 18.6GB for GPT2 with torch.compile.
With 1 & 2, the peak memory is
1. 17.7GB with a cold cache
2. 15.5GB with a warm cache (since the autotuning overhead is skipped)

With 1 & 2 & 3, we save 3GB peak memory  (18.6GB -> 15.5GB) no matter if autotuning happens or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145410
Approved by: https://github.com/masnesral, https://github.com/jansel
ghstack dependencies: #140249, #145325
2025-01-23 09:34:23 +00:00
e343f46464 [inductor] Refactor is_big_gpu (#142220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142220
Approved by: https://github.com/yanboliang
ghstack dependencies: #142219, #142033, #142222
2024-12-08 18:51:36 +00:00
ad2cc96218 Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)
This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587
Approved by: https://github.com/jansel
2024-12-05 09:57:08 +00:00
d08bd6d627 Revert "Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)"
This reverts commit 8a3317cd41d0442d13090932ae5548e7b9fe45bd.

Reverted https://github.com/pytorch/pytorch/pull/141587 on behalf of https://github.com/atalman due to inductor/test_torchinductor_strided_blocks.py::TritonBlockPointerTestGPU::test_expand_broadcast_x_size0_y_size0_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/12072823884/job/33669367764) [HUD commit link](8a3317cd41) ([comment](https://github.com/pytorch/pytorch/pull/141587#issuecomment-2506690095))
2024-11-28 19:41:03 +00:00
8a3317cd41 Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)
This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587
Approved by: https://github.com/jansel
2024-11-28 16:45:25 +00:00
0dbbcfa7ae [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_pattern_matcher.py`
reuse `test/inductor/test_snode_runtime.py`
reuse `test/inductor/test_unbacked_symints.py`
fix `test/inductor/test_triton_kernels.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
2024-10-12 13:21:20 +00:00
99eb47fb6d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet
2024-10-01 20:43:10 +00:00
7fe004f7cf Revert "Add CI for Triton CPU backend (#135342)"
This reverts commit 426580a67db15ec17b2b861a09667bf59927e033.

Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
426580a67d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel
ghstack dependencies: #133408
2024-09-14 21:45:19 +00:00
bb3c2408f4 [inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936)
Differential Revision: D61506212

Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`.

`instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`.

Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda.

FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device'

repro:
```
CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
2024-09-05 16:40:14 +00:00
c2e2602ecd [Inductor] Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from (#132740)
Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-08-21 11:18:00 +00:00
7bd0732cbd Fix flaky internal mixed_mm tests (#133015)
This PR fixes flaky internal tests:
- The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit.
- The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for).

Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-08-08 20:32:12 +00:00
50ef099ad0 Learn a heuristic to decide whether to pad before mm (#128643)
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.

The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.

The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.

The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.

On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.

The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.

The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.

A100
```
       max_depth  min_samples_leaf dataset  correct  wrong  unsure  total  max_wrong_speedup  gman_wrong_speedup  threshold
15         5.0                10     train     2730      4    3023   5757           1.372220            1.193873   1.702530
16         5.0                10       val      878      0    1042   1920                NaN                 NaN   1.702530
17         5.0                10      test      925      2     993   1920           1.741708            1.354954   1.702530
18         5.0                10  hf-train       14      0      22     36                NaN                 NaN   1.702530
19         5.0                10    hf-inf        7      0       1      8                NaN                 NaN   1.702530
```

The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10

## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").

I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.19 (+/- 0.00)  1.19 (+/- 0.00)         -0.00              40.33 (+/- 1.13)             40.95 (+/- 0.78)                     -0.62                     1.52                   3                         2
  BartForConditionalGeneration   1.53 (+/- 0.06)  1.47 (+/- 0.05)          0.06              81.93 (+/- 5.20)             82.23 (+/- 1.92)                     -0.30                     0.36                   3                         1
    BlenderbotSmallForCausalLM   1.86 (+/- 0.04)  1.86 (+/- 0.00)          0.00              36.76 (+/- 0.49)             37.62 (+/- 1.33)                     -0.87                     2.31                   3                         2
                     CamemBert   2.36 (+/- 0.01)  2.35 (+/- 0.01)          0.01              97.60 (+/- 1.91)             98.69 (+/- 1.35)                     -1.09                     1.11                   2                         1
                   DistillGPT2   2.57 (+/- 0.01)  2.57 (+/- 0.01)          0.00              57.33 (+/- 0.77)             58.26 (+/- 1.41)                     -0.93                     1.59                   3                         2
             PLBartForCausalLM   2.07 (+/- 0.01)  2.06 (+/- 0.01)          0.01              32.54 (+/- 0.83)             34.65 (+/- 0.71)                     -2.11                     6.10                   3                         2
PLBartForConditionalGeneration   1.87 (+/- 0.00)  1.88 (+/- 0.00)         -0.01              58.45 (+/- 1.24)             58.95 (+/- 1.92)                     -0.50                     0.85                   3                         1
            RobertaForCausalLM   2.39 (+/- 0.01)  2.40 (+/- 0.01)         -0.01              97.38 (+/- 1.52)             97.69 (+/- 1.18)                     -0.31                     0.32                   2                         1
              TrOCRForCausalLM   1.70 (+/- 0.00)  1.70 (+/- 0.00)         -0.00              44.79 (+/- 1.33)             45.25 (+/- 1.08)                     -0.46                     1.01                   3                         2

Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```

Results on huggingface inference
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.11 (+/- 0.00)  1.11 (+/- 0.00)          0.00              19.02 (+/- 0.28)             19.40 (+/- 0.35)                     -0.38                     1.95                   3                         2
  BartForConditionalGeneration   1.26 (+/- 0.01)  1.23 (+/- 0.03)          0.03              36.84 (+/- 0.40)             36.55 (+/- 0.75)                      0.30                    -0.81                   3                         1
    BlenderbotSmallForCausalLM   1.87 (+/- 0.02)  1.87 (+/- 0.01)          0.00              17.53 (+/- 0.31)             18.03 (+/- 0.43)                     -0.49                     2.74                   3                         2
                   DistillGPT2   2.50 (+/- 0.02)  2.50 (+/- 0.01)          0.00              16.16 (+/- 0.29)             16.40 (+/- 0.18)                     -0.24                     1.46                   3                         2
             PLBartForCausalLM   1.93 (+/- 0.01)  1.94 (+/- 0.01)         -0.00              15.30 (+/- 0.22)             16.01 (+/- 0.71)                     -0.71                     4.43                   3                         2
PLBartForConditionalGeneration   1.98 (+/- 0.01)  1.98 (+/- 0.01)          0.00              25.90 (+/- 0.32)             26.58 (+/- 0.62)                     -0.67                     2.53                   3                         1
              TrOCRForCausalLM   1.61 (+/- 0.00)  1.62 (+/- 0.00)         -0.01              21.38 (+/- 0.37)             21.85 (+/- 0.16)                     -0.47                     2.16                   3                         2

Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```

For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-07-15 23:04:06 +00:00
f1ee3589a1 [Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342)
**Summary**

Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets.

Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be:
  - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1.
  - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size.

Note that a *slice* of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`.

Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077

Example kernel:
```
triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4096
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK])
    tmp1 = tmp0 + tmp0
    tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32))
''', device_str='cuda')
```

**Test Plan**

This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories:
  - Can we generate strided block pointers for the appropriate shapes?
     - Powers of 2
     - Non-power of 2, but multiple of the maximum block size
     - Arbitrary leading dimensions, with power of 2 inner dimensions
     - Weird strides and offsets
     - Reductions
     - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo)
     - Broadcasts (some variables are missing from the indexing expression)
  - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers?
     - Unsupported static shapes
     - Unsupported symbolic shapes
  - Mixing and matching these cases:
     - Pointwise and reduction in the same kernel
  - Sanity check the test harness
     - Do we raise an exception if the expected number of block pointers and the actual number are different?

**Follow-ups**

There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs:
  - Handle non-divisible shapes
      - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted.
      - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton.
 - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead?

Differential Revision: D56739375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-06-16 07:35:57 +00:00
63d7ffe121 Retry of D58015187 Move AsyncCompile to a different file (#127691)
Summary:
This is a retry of https://github.com/pytorch/pytorch/pull/127545/files
and
D58015187, fixing the internal test that also imported codecache

Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now

Differential Revision: D58054611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691
Approved by: https://github.com/oulgen
2024-06-03 15:29:41 +00:00
22f392ba40 Revert "[easy?] Move AsyncCompile to a different file (#127235)"
This reverts commit f58fc16e8f059232f452a333f32e14ff681e12af.

Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))
2024-06-01 17:16:16 +00:00
d49dc8f4b8 Revert "Add noqa to prevent lint warnings (#127545)"
This reverts commit f9937afd4f87fbb4844642ae2f587b13b5caa08c.

Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))
2024-06-01 17:12:46 +00:00
f9937afd4f Add noqa to prevent lint warnings (#127545)
This is to prevent the import from being removed due to unused import. What's annoying about this is that it's not consistently running: lintrunner doesn't warn me on this PR even without the comment, but it does on other PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127545
Approved by: https://github.com/masnesral
2024-05-30 17:56:49 +00:00
f58fc16e8f [easy?] Move AsyncCompile to a different file (#127235)
By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes.

To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235
Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral
2024-05-30 02:43:02 +00:00
51e707650f Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-22 17:28:46 +00:00
8a4597980c Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)"
This reverts commit 831efeeadf5fa8d9e7f973057e634a57e3bcf04b.

Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00