Commit Graph

582 Commits

Author SHA1 Message Date
f04cf13bdd Revert "Merge Triton ScaledMM as epilogue to MM template (#150045)"
This reverts commit 981048854da154eae8ff0bd439e72e1256ae00da.

Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))
2025-04-01 17:54:28 +00:00
981048854d Merge Triton ScaledMM as epilogue to MM template (#150045)
Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor.

Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done.
TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045
Approved by: https://github.com/drisspg
2025-03-31 23:20:14 +00:00
266bd22b44 Improve subproc autotuning implementation (#149700)
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.

One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`

List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.

Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).

Differential Revision: [D71976971](https://our.internmc.facebook.com/intern/diff/D71976971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
2025-03-28 01:06:39 +00:00
c830d750e6 [graph partition] support splitting on custom ops (#149782)
This PR adds support for graph partition on custom ops. Land after #149458.

### API
This PR provides a new API to register/unregister custom ops for graph partition.

```python
def register_custom_op_support_cudagraph(
    operator: torch._library.custom_ops.CustomOpDef,
    is_cudagraphable: bool,
) -> None
```

Example usage:

```python
from torch._inductor.utils import register_custom_op_partition

@torch.library.custom_op("mylib::movement", mutates_args=())
def movement(pic: torch.Tensor) -> torch.Tensor:
    img = pic.cpu()
    cropped_img = (img + 1) * 2
    return cropped_img.cuda() / 255.0

@movement.register_fake
def _(pic):
    return torch.empty_like(pic)

register_custom_op_support_cudagraph(movement, is_cudagraphable=False)
```

### Example
In this example, 1 torch-compiled region has 3 cudagraphs after splitting on 2 custom ops.

![image](https://github.com/user-attachments/assets/6d07355b-6690-4cde-89ef-e4aff6b0079c)

Code to repro:
```python
import torch
from torch._inductor.utils import register_custom_op_support_cudagraph

torch._inductor.config.graph_partition = True

@torch.library.custom_op("mylib::movement", mutates_args=())
def movement(pic: torch.Tensor) -> torch.Tensor:
    img = pic.cpu()
    cropped_img = (img + 1)*2
    return cropped_img.cuda() / 255.

@movement.register_fake
def _(pic):
    return torch.empty_like(pic)

@torch.library.custom_op("mylib::modify", mutates_args=())
def modify(pic: torch.Tensor) -> torch.Tensor:
    pic1 = pic + 1
    pic1_cpu = (pic1.cpu() + 1) * 2
    return pic1_cpu.cuda() + pic

@modify.register_fake
def _(pic):
    return torch.empty_like(pic)

@torch.library.custom_op("mylib::transform", mutates_args=())
def transform(pic: torch.Tensor) -> torch.Tensor:
    return (pic + 1) * 2

@transform.register_fake
def _(pic):
    return torch.empty_like(pic)

register_custom_op_support_cudagraph(movement, is_cudagraphable=False)
register_custom_op_support_cudagraph(modify, is_cudagraphable=False)

img = torch.randn(3, 64, 64, device="cuda")

def f(img):
    x = (img + 10) * 2
    y = movement(x)
    z = y + 1
    u = transform(z)
    v = 2*u + 1
    out = modify(v)
    return out + 1

compiled_f = torch.compile(f, mode="reduce-overhead", fullgraph=True)

eager_out = f(img)

for _ in range(3):
    compiled_out = compiled_f(img)
    assert torch.allclose(eager_out, compiled_out)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149782
Approved by: https://github.com/zou3519
2025-03-27 16:23:07 +00:00
48cff64a54 [pt2_provenance_tracing] add combo kernel nodes post_grad nodes origin info (#149598)
Summary: found it helpful when running prod model with combo_kernel feature enabled

Test Plan: CI

Differential Revision: D71513304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149598
Approved by: https://github.com/yushangdi
2025-03-27 00:26:24 +00:00
185aaaaf8e Revert "Improve subproc autotuning implementation (#149700)"
This reverts commit 8cd6a133f21821f0713116f0f9a55e5368de8c1c.

Reverted https://github.com/pytorch/pytorch/pull/149700 on behalf of https://github.com/yangw-dev due to This is breaking servicelab_benchmark_pyper_local_runner internally ([comment](https://github.com/pytorch/pytorch/pull/149700#issuecomment-2755975959))
2025-03-26 23:17:01 +00:00
8cd6a133f2 Improve subproc autotuning implementation (#149700)
Summary: The primary change is to update the autotune-in-a-subproc implementation to avoid using multiprocessing spawn. Spawn (re)executes the toplevel script in the subproc, which can be problematic. The approach here is similar to Triton parallel compile: we Popen a subproc on a controlled entry point and communicate over pipes. That change drove a lot of refactoring in the TuningProcess class, so I took the opportunity to simplify some things, rename some methods, etc.

One other notable change is around the timeout / kill approach. After a timeout, we were previously attempting to stop the subproc in three steps (graceful shutdown, sigkill if graceful fails, sigterm if sigkill fails). I'm gonna argue think that's not useful: 1) The graceful shutdown is never going to work unless the subproc happens to have just completed its task and is ready to receive the next command. 2) If we're going to kill the subproc, let's just take the most aggressive approach and move on as quickly as possible to restarting it rather than waiting to see if previous shutdown attempts succeeded. The only downside that I can find find is maybe a little log spew?, e.g., ` ResourceWarning: subprocess 2987680 is still running`

List of changes:
* Use Popen instead of spawn for the autotuning subprocess.
* Introduced a new entry point `__autotune_main__.py`
* Renamed some TuningProcess methods. For example `shutdown` makes more sense than `terminate` because the latter implies a forced kill.
* Simplified the implementation around benchmarking timeout and how we kill the subproc after a timeout.
* Deprecated the unused timeout configs in `_inductor/config.py`
* Moved `get_ld_library_path` helper to a common utils file.
* Added more unit tests for subproc crashes / timeouts / exceptions, etc.

Test plan:
* New unit tests
* Also ran internally with all combinations of: build mode `opt` and `dev-nosan`, and `buck run` vs. executing the `.par` file directly.
* Made sure the functionality to parallelize autotuning across different GPUs is working (it wasn't clear to me this was behaving the way we wanted it to).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149700
Approved by: https://github.com/aorenste, https://github.com/jansel, https://github.com/eellison
2025-03-25 20:07:28 +00:00
f7d1b966c2 [Inductor] Unify the data type propagation between Triton and CPP Backend (#146970)
Fixes #144246

Use `DtypePropagationOpsHandler` for CSE variables of CPP backend. In addition, add static type checking for the generated CPP code similar to the `config.test_configs.runtime_triton_dtype_assert`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146970
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/leslie-fang-intel
2025-03-21 17:52:51 +00:00
7bb9c36784 Hook StaticCudaLauncher up to torch.compile (cold start) (#148890)
This hooks up the previous PR to torch.compile. Will add a config flag to hide this behind in a bit, but for now it's useful for testing purposes to have it on by default.

Inductor will automatically choose to use StaticCudaLauncher to launch triton kernels if:
- The kernel is a cuda kernel and inductor can find a cubin file associated with it
- The kernel takes less than 50 arguments
- The kernel doesn't use any special features (launch hooks, large amounts of shared memory)
- The kernel is not user defined (to be supported in a later PR)

We split CompileResult into TritonCompileResult and StaticTritonCompileResult, but have them share implementations of how they exec a python launcher. StaticTritonCompileResult's python launcher has the benefit of a simpler def_args/call_args setup, since it always filters out all constexprs before running, no matter the triton version.

Some key features of StaticTritonCompileResult:
- It is fully serializable
- It stores the minimum amount of stuff, so that later it can be cached easily
- It does not depend on any triton specific types (though it does have various triton metadata).

For now, both TritonCompileResult and StaticTritonCompileResult still `exec` custom python launchers, and use GridExpr. We can change that in the future to simplify if we'd like. For now though, this custom python codegen is good for flexibility when it comes to supporting removal of constexprs, so using it for static launching is nice to not have to pay the cost of removing constexprs at kernel runtime.

Hooking everything up to torch.compile lets me run every unit test with StaticCudaLauncher to make sure that we still pass (even if we bypass StaticCudaLauncher itself). It also lets me check for compilation/runtime performance with these changes.

Fixes #149448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148890
Approved by: https://github.com/jansel
2025-03-20 17:32:20 +00:00
a0ac63cbd9 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-18 00:46:07 +00:00
24cfeec2c7 Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)"
This reverts commit bfee141666319c80b6c5284394905beef8682515.

Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see 8bc7bd94a5/1 ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))
2025-03-17 22:57:00 +00:00
bfee141666 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-16 23:52:58 +00:00
3e605fe46d [CUDAGraph] Graph Partition (#147648)
This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR.

## Example
```python
import torch

torch._inductor.config.graph_partition = True

def f(x, y):
    x1 = x + 1
    y1 = y + 1
    y_cpu = y1.cpu() + 1
    z = x @ y
    return x1 + y1 + z + y_cpu.cuda()

x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)]
x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]]
eager_out = f(x, y)

f_compiled = torch.compile(f, mode="reduce-overhead")

for _ in range(5):
    compiled_out = f_compiled(x_cloned, y_cloned)
    assert torch.allclose(eager_out, compiled_out)
```

w/o graph partition, we will skip cudagraph:
```
skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from :
   File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f
    y_cpu = y1.cpu() + 1 # 3
```

w/ graph partition, we can see two cudagraphify under the same torch-compiled region:
![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25)

## Design

PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions.

The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is:
- compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`.
- during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`.
- finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`.

#### Q: How does it work with codecache?

While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later.

## Edge Case 1
PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`.

## Edge Case 2
Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420)

![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40)

To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648
Approved by: https://github.com/eellison
2025-03-13 16:00:21 +00:00
c916a8efc5 Revert "Use the device interface for detecting Triton availability (#139171)"
This reverts commit 940b60db974f08a31c746eec2f9c399fc8a861ee.

Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))
2025-03-11 18:49:21 +00:00
940b60db97 Use the device interface for detecting Triton availability (#139171)
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.

This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
2025-03-11 03:56:11 +00:00
b215841ebb [MM] Add sm carevout to lowerings (#148793)
# Summary

See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using
the following to verify, need to figure out how to do proper guarding

This does  do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793
Approved by: https://github.com/lw, https://github.com/eellison
2025-03-10 23:49:26 +00:00
4c13a859e5 Workaround no triton float8_e8m0fnu support in inductor (#148722)
Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873.

The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722
Approved by: https://github.com/vkuzo
ghstack dependencies: #148450
2025-03-10 17:37:39 +00:00
50c9f6d83b [Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323)
In `fresh_inductor_cache` remove pyd files will raise permission error
on Windows because they are still used by the process.
So we clear the references to the loaded pyd libray obj and unload them
from the process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323
Approved by: https://github.com/jansel
ghstack dependencies: #148534, #148538, #147727
2025-03-07 17:19:59 +00:00
262411e48b [inductor] online softmax (#127011)
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row

When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.

Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).

Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54

## Microbenchmark

- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
  - eager_ms=6.671296119689941
  - opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
  - eager_ms=6.634047985076904
  - opt_ms=6.230591773986816

Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
2025-03-06 21:07:18 +00:00
d6d670ab4d [AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587)
In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits).

Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587
Approved by: https://github.com/desertfire
2025-03-05 22:47:46 +00:00
ab81ca5053 [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes.

Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested.

**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 00:56:29 +00:00
1cb4e2df65 [BE][PYFMT] migrate PYFMT for torch._inductor to ruff format (#144550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550
Approved by: https://github.com/jansel
2025-02-28 13:33:19 +00:00
eb9c127341 [dynamo][optimizers] Install ID_GUARDED tensors into the Fx graph (#147824)
Earlier, with inline flag we were lifting id-guarded tensors to the inputs to the Fx graph. But this offers no benefit. Main idea behind lifting parameters as inputs was to reuse the compilation units across many instances of the nn-module. However, if we are guarding on the `id`, we are explicitly specializing the compiled artifact to the parameter.

This PR installs the parameters back into the graph. The benefit is removal of all pre-graph bytecode to extract the id-guarded tensors from locals/globals. This increases speedup from 1.67x to 1.75x for an internal model that has large number of optimizer parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147824
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-02-28 03:22:11 +00:00
481a57bc37 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-28 00:47:03 +00:00
17358ce778 Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878)"
This reverts commit ad0c879e2203145f6d56df0b95af36822220ab8f.

Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))
2025-02-27 03:36:16 +00:00
ad0c879e22 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-27 02:08:29 +00:00
452315c84f Fix RuntimeError: value cannot be converted to type int64_t without overflow (#147492)
The exact call is coming from here:

78a94c9114/torch/_inductor/memory.py (L161)

I have no idea why this error is being thrown and what mode/modes might be failing for this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147492
Approved by: https://github.com/eellison
2025-02-20 08:00:26 +00:00
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
44ee9ca593 [inductor] Add type annotations to _inductor/utils.py (#144108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144108
Approved by: https://github.com/eellison
2025-02-15 23:13:41 +00:00
579b9f2ed9 [inductor] Better exception error messages for cache_on_self (#146652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146652
Approved by: https://github.com/yanboliang
2025-02-07 21:22:21 +00:00
e9f6e273e7 [inductor] Add typing to common.CSE (#145993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993
Approved by: https://github.com/yanboliang
ghstack dependencies: #145916
2025-02-04 16:05:39 +00:00
d3c7e4bb9c Revert "[inductor] Add typing to common.CSE (#145993)"
This reverts commit 8c657ae4be55c6133307ad278c1740af5db133a7.

Reverted https://github.com/pytorch/pytorch/pull/145993 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/145993#issuecomment-2632712384))
2025-02-04 03:04:01 +00:00
8c657ae4be [inductor] Add typing to common.CSE (#145993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914, #145915, #145916
2025-02-01 16:34:18 +00:00
57d8278ab9 pickler for GraphModule (#141659)
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.

Differential Revision: [D68921318](https://our.internmc.facebook.com/intern/diff/D68921318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
2025-01-31 05:34:28 +00:00
2de53b3b65 Revert "pickler for GraphModule (#141659)"
This reverts commit c6ad08357bf8e766b5220bfb5cbbfdb2a4ec0ca5.

Reverted https://github.com/pytorch/pytorch/pull/141659 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, please take a look at D68694181 for more details. ([comment](https://github.com/pytorch/pytorch/pull/141659#issuecomment-2617045120))
2025-01-27 22:39:30 +00:00
c6ad08357b pickler for GraphModule (#141659)
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
2025-01-26 19:29:13 +00:00
a40ead1fd6 Don't fail if fresh_inductor_cache fails to clean up its tmp dir. (#145513)
Summary: I see we have a test failure due to an error removing the tmp dir: https://github.com/pytorch/pytorch/issues/141761. Seems like we should not raise an exception for this case in general. Also, let's clean up the exception handling related to windows. The comment makes it sound like we want to specifically ignore failures cleaning up, but the current impl is swallowing all exceptions.

Fixes #141761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145513
Approved by: https://github.com/eellison
2025-01-24 03:17:03 +00:00
b963ab5325 [inductor][1/N] triton support post-#5512, main components (#145051)
Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This is an initial PR to add support for Triton versions after commit 5512 landed.

The main changes in 5220 and 5512 that need to be supported:
* AttrsDescriptor() gets replaced with a raw dict. The raw dict has the format `{(TUPLES): [["tt.divisibility", 16]]}`, where `(TUPLES)` is a tuple of indices, e.g. `((0,), (1,), (3,))` to indicate that args 0, 1, and 3 are divisible by 16. These indices are, themselves, represented as tuples to support nested inputs (e.g. an argument that's a tuple), but support for tuples is not implemented right now.
* "signature" changes: the signature now contains _all_ args, including constexpr and constant args.
* ASTSource now takes "constexprs" instead of "constants" - for example, equal-to-1 args are constants but not constexprs so we don't need to pass these args as "constants".

What this PR supports:
* Triton versions before Dec 9, 2024, and (partial support for) Triton versions after Jan 1, 2025
* (triton jan 1+) typical inductor-generated triton: updated AttrsDescriptor, signatures, constexpr/constant handling.

What this PR doesn't support (TODO in follow-up PRs):
* Triton versions between Dec 9, 2024 and before Jan 1, 2025
* (triton jan 1+) user-defined triton kernel support (this is implemented already in @anmyachev's patch)
* (triton jan 1+) triton_helper support (failing in triton codegen - needs investigation)
* (triton jan 1+) AOTI / cpp wrapper

thanks to @anmyachev for patches in https://github.com/intel/intel-xpu-backend-for-triton/blob/main/scripts/pytorch.patch, which contains most of these changes already

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145051
Approved by: https://github.com/jansel
2025-01-24 00:34:01 +00:00
46851022ff [Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM (#143187)
## Summary

Templated `int8xint8->int32` GEMM that uses AMX ISA (present on Intel Xeon Gen 4 & above). Any epilogues such as weight scale, activation scale, and bias are applied per output block in a fused manner .
Performs well for large values of `M` dimension (assuming canonical dimensions [`M, K`] and [`K, N`] for the activation & weight matrices'/tensors' sizes) when the activation is quantized per-token.
Also supports SmoothQuant GEMM pattern when activation is quantized per-tensor (scalar scale) or per-token (vector scale is applied as an epilogue in this case).

Also increased coverage of GEMM template for uint8 activation, int8 weight GEMM UTs for when the activation zero point is a 1D tensor (the existing implementation only accepted 0D tensors). However, some of such UTs would have to be explicitly enabled with `max-autotune` Inductor config.

## Performance data

The templated codegened fused GEMM with M=32, K=4096, N=14336 used in LLaMA3 exhibits more than 2x perf-gain compared to oneDNN qlinear + mul (for activation's scale) with 48 cores of one socket of Xeon SP 4th gen Platinum 8468 when per-token quantization is used.

For M=1, K=4096, N=14336, regardless of whether per-tensor quantization was used for activation or per-token, the perf gain was more than 3x.

Intel OpenMP & libtcmalloc had been preloaded. All cores used by the workload corresponded to distinct physical cores.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143187
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5

Co-authored-by: Leslie Fang <leslie.fang@intel.com>
2025-01-22 02:27:53 +00:00
bac62341eb PEP585 update - torch/_inductor (#145198)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145198
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:33 +00:00
9d98b66e7b [Inductor][CPP] Enable Epilogue Fusion for Grouped GEMM Template (#143897)
**Summary**
In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it.

**Fusion**

- The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM.
- During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design.
- In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles.

**Code Gen**
We maintain a list of epilogues and codegen it one by one.

- If any of the GEMM has bias, we create  a extra `bias_add` epilogue and prepend it at first of the epilogue list.
- If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #143796
2025-01-14 06:07:50 +00:00
a3ab27b8e0 Migrate from Tuple -> tuple in torch/_inductor (#144264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264
Approved by: https://github.com/eellison
2025-01-07 03:27:27 +00:00
934eaa503f [Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-30 23:51:17 +00:00
2da7fb5320 [inductor] Make generated kernels deterministic (#143951)
`"compile_id"` had slipped into our generated Triton code (in the
metadata), which will defeat caching because the same kernels generated
in a different order would not cache hit with eachother.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951
Approved by: https://github.com/oulgen
2024-12-30 23:35:11 +00:00
1b0d19a2cb Revert "[inductor] Make generated kernels deterministic (#143951)"
This reverts commit 79b354ee37b7d8a06a48ca8cc4e19a3fd006b433.

Reverted https://github.com/pytorch/pytorch/pull/143951 on behalf of https://github.com/wdvr due to failing tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/143951#issuecomment-2564952267))
2024-12-30 02:06:38 +00:00
79b354ee37 [inductor] Make generated kernels deterministic (#143951)
`"compile_id"` had slipped into our generated Triton code (in the
metadata), which will defeat caching because the same kernels generated
in a different order would not cache hit with eachother.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951
Approved by: https://github.com/oulgen
2024-12-29 19:53:33 +00:00
844e6108f6 Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)"
This reverts commit ad750ae32079020f51f9b7d01237f3ecfa83b6ff.

Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))
2024-12-24 17:22:57 +00:00
ad750ae320 [Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-24 05:42:36 +00:00
a316a4581d Add mps to GPU_TYPES (#143634)
Because it is a GPU, but don't require a triton, as it does not need one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143634
Approved by: https://github.com/jansel
2024-12-22 18:37:35 +00:00
af0e159740 [Inductor XPU] Add XPU check for is_big_gpu(). (#143491)
Fix #143472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143491
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
2024-12-21 02:27:04 +00:00