Commit Graph

502 Commits

Author SHA1 Message Date
d3f196909d [inductor] let inplace-padding support cpp-wrapper (#145325)
Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth.  One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed).

The fix is when we allocate the tensor, instead of doing something like:
```
  buf0 = randn_strided([2048, 2047], [2048, 1])
```
we do some small overallocation
```
  buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1])
```

cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #140249
2025-01-23 09:22:38 +00:00
3a58512613 [Inductor] inplace padding (#140249)
https://github.com/pytorch/pytorch/issues/139865

This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine.

Perf for `test_linear_and_cel`:
```
# TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=False ms=83.311

# TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=True ms=79.827
```

The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c.
- Without the feature: 182.151ms per batch, 180.9K tokens/s
- With the feature:  178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase.

Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) .

UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration.

Differential Revision: [D68340248](https://our.internmc.facebook.com/intern/diff/D68340248)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249
Approved by: https://github.com/jansel, https://github.com/eellison
2025-01-22 03:37:06 +00:00
2bf772d1ba PEP585 update - torch/_inductor/codegen (#145106)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145106
Approved by: https://github.com/bobrenjc93
2025-01-18 06:56:03 +00:00
94c0f15302 Revert "cpp_wrapper: Move #includes to per-device header files (#143909)"
This reverts commit d62b3979dadfa4928ec1c76e850f874d49803125.

Reverted https://github.com/pytorch/pytorch/pull/143909 on behalf of https://github.com/kit1980 due to breaking internal builds because of removal of torch‎/_inductor‎/codegen‎/aoti_runtime‎/implementation.cpp‎ ([comment](https://github.com/pytorch/pytorch/pull/143909#issuecomment-2597188669))
2025-01-17 00:36:38 +00:00
d62b3979da cpp_wrapper: Move #includes to per-device header files (#143909)
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Differential Revision: [D67938955](https://our.internmc.facebook.com/intern/diff/D67938955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143909
Approved by: https://github.com/desertfire
2025-01-15 21:14:02 +00:00
25de671ea8 [Inductor][CPP] Enable Grouped GEMM Template (#143796)
**Summary**
Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012

- Support flexible number of GEMMs
- Share activation across GEMMs
  - The Grouped GEMM Template supports independent activations
  - However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs
- Each GEMM can have a unique weight but same sizes
- Each GEMM can have a unique bias or None
  - Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR
- Each GEMM have its own epilogues
  - Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid
python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear
```

**Example**
Here is the example and generated code
```
batch_size = 4
in_features = 512
out_features = 1024
dtype = torch.bfloat16

class M(torch.nn.Module):
    def __init__(self, bias):
        super().__init__()
        self.linear0 = torch.nn.Linear(in_features, out_features, bias=False)
        self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)

    def forward(self, x):
        return self.linear0(x), self.linear1(x)

if __name__ == "__main__":
    with torch.no_grad():
        input = torch.randn(batch_size, in_features, dtype=dtype)
        m = M(bias=bias).to(dtype=dtype).eval()
        cm = torch.compile(m)
        act_res = cm(input)
```

Generated Code:  https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py

**Next Step**

- Support Epilogue fusion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796
Approved by: https://github.com/jgong5, https://github.com/jansel
2025-01-14 05:59:07 +00:00
8633845090 Support nanj in inductor (#144064)
Fixes https://github.com/pytorch/pytorch/issues/144029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144064
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-13 14:29:38 +00:00
a3ab27b8e0 Migrate from Tuple -> tuple in torch/_inductor (#144264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264
Approved by: https://github.com/eellison
2025-01-07 03:27:27 +00:00
f3e5078c27 [Inductor] Relax size constraints for re-inplacing (#143884)
Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing.

This PR changes the size constraints to be:
- input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str).
- it's statically known that 0.99 * input_size <= output_size <= input_size

### Apply on llm.c
See the reuse of `buf1`.
Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078))
![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9)

After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053))
![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884
Approved by: https://github.com/eellison
2024-12-31 03:52:47 +00:00
969415885d [inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373
Approved by: https://github.com/eellison
2024-12-27 06:46:09 +00:00
f1cbf4b1b5 Enable ruff's unused variable checking everywhere in pytorch (#136965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-12-22 02:33:11 +00:00
da67a6a7bb [inductor] Replace set by OrderedSet (#138466)
Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454
and considerable manual editing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466
Approved by: https://github.com/eellison
2024-12-13 16:08:45 +00:00
dc23f1944a Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-12 17:39:14 +00:00
5c97ac9721 Revert "Remove unused Python variables in torch/[_-a]* (#133492)"
This reverts commit fda975a7b3071a20dab8fc2c4e453479e1bb7cf2.

Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else.  The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))
2024-12-11 17:29:12 +00:00
fda975a7b3 Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-10 21:48:44 +00:00
539286a67b Inductor annotations (#130429)
Add NVTX annotations around training phases and buffer computations

RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224

<img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429
Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD

Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>
2024-12-10 08:53:39 +00:00
4d43ec2189 [AOTI] Swith GPU codegen to one-pass (#141980)
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980
Approved by: https://github.com/chenyang78
2024-12-09 14:40:34 +00:00
5035ff0796 [AOTI] Refactor codegen_inputs signature (#142133)
Summary: Since codegen_inputs only writes to self.prefix, drop IndentedBuffer from its parameters, to make the API consistent with other similar functions.

Differential Revision: [D66881040](https://our.internmc.facebook.com/intern/diff/D66881040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142133
Approved by: https://github.com/chenyang78
2024-12-08 15:05:03 +00:00
0367a31401 [inductor] Minor typing changes (#142219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142219
Approved by: https://github.com/Skylion007, https://github.com/yanboliang
2024-12-07 17:48:37 +00:00
39482907be [AOTI] Refactor codegen_inputs in wrapper codegen (#141965)
Summary: Fork codegen_inputs for CppWrapperCodegen, because the behavior between python and cpp needs to diverge. On the python side, input backed symbols need to be generated for the autotune block. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66718225](https://our.internmc.facebook.com/intern/diff/D66718225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141965
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388, #141387, #141979
2024-12-05 19:49:34 +00:00
2fd8a7be71 [AOTI] Refactor additional_files generation (#141979)
Summary: https://github.com/pytorch/pytorch/pull/140675 adds logic to collect all the generated cubin file paths into an additional_files list, but the collection should only happen when DeferredGpuKernelLine is materialized. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66718227](https://our.internmc.facebook.com/intern/diff/D66718227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141979
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388, #141387
2024-12-05 19:49:02 +00:00
5f28c42746 [AOIT] Remove several overloaded members from WrapperCodegen (#141387)
Summary: Remove several overloaded string members from WrapperCodegen classes, including open_bracket, closed_braket, size, stride. Instead of relying on polymorphism, we explicitly generate different strings for PythonWrapperCodegen and CppWrapperCodegen. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66459991](https://our.internmc.facebook.com/intern/diff/D66459991)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141387
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388
2024-12-05 19:29:38 +00:00
4cc0fc2707 [AOTI] Remove WrapperCodegen.expr_printer (#141388)
Summary: Avoid using expr_printer as an overriden class member for WrapperCodegen. Instead, use pexpr and cexpr explicitly for python and cpp expression print respectively. This is to prepare for one-pass AOTI CUDA codegen, where PythonWrapperCodegen is used to generate the autotune block and CppWrapperCodegen is used to generate the model code.

Differential Revision: [D66459992](https://our.internmc.facebook.com/intern/diff/D66459992)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141388
Approved by: https://github.com/chenyang78
2024-12-05 19:20:39 +00:00
f16e08042c [user triton] Fix grid codegen for configs with empty kwargs (#141824)
Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824
Approved by: https://github.com/Chillee
2024-12-02 04:17:21 +00:00
17fd53d8e5 [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-27 18:51:52 +00:00
65dbd5cc2d Revert "[Inductor] Inplacing with Donated Buffer (#140113)"
This reverts commit eecc8e362c2eb192cbe13322af941d09ca647a6b.

Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))
2024-11-26 21:20:59 +00:00
000d4e9d43 [hop][inductor] remove codegen_subgraph_suffix and directly assign call function result to outer outputs (#141181)
Before the PR: P1683356646
after the pr: P1683356585

Relevant changes:
```
@@ -231,7 +421,8 @@
             true_graph_0_args = [true_graph_0_arg0_1, true_graph_0_arg1_1]
             del true_graph_0_arg0_1
             del true_graph_0_arg1_1
+            (buf5[0],) = true_graph_0(true_graph_0_args)
-             (true_graph_0_buf0,) = true_graph_0(true_graph_0_args)
-             buf5[0] = true_graph_0_buf0
         else:
             # subgraph: false_graph_0
             false_graph_0_arg0_1 = buf4
@@ -239,7 +430,8 @@
             false_graph_0_args = [false_graph_0_arg0_1, false_graph_0_arg1_1]
             del false_graph_0_arg0_1
             del false_graph_0_arg1_1
+            (buf5[0],) = false_graph_0(false_graph_0_args)
-             (false_graph_0_buf0,) = false_graph_0(false_graph_0_args)
-             buf5[0] = false_graph_0_buf0
         del arg2_1
         del buf4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141181
Approved by: https://github.com/anijain2305
ghstack dependencies: #140334, #141172
2024-11-26 17:32:51 +00:00
aae581d921 [hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172
Approved by: https://github.com/Chillee
ghstack dependencies: #140334
2024-11-26 17:32:50 +00:00
eecc8e362c [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-26 17:19:50 +00:00
4742080ed9 [AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-11-26 11:51:32 +00:00
6eca0aee76 [inductor] Refactor ir.Layout into ir.OutputSpec (#140910)
This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs).  Which should make typing easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910
Approved by: https://github.com/ezyang
ghstack dependencies: #140895
2024-11-21 20:01:57 +00:00
a847790400 [inductor] reset to zero support for user defined Triton kernels (#140982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140982
Approved by: https://github.com/aakhundov
2024-11-21 18:53:23 +00:00
808f0f656d [inductor] Refactor MutableBox to make IRNode typing easier (#140895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-11-20 19:50:46 +00:00
4ffce45100 AOTInductor: properly generate cpp_wrapper runtime assertions (#141050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141050
Approved by: https://github.com/desertfire
ghstack dependencies: #141058
2024-11-20 19:17:47 +00:00
d472a5f680 Revert "[inductor] Refactor MutableBox to make IRNode typing easier (#140895)"
This reverts commit c79e78b5034198f9d6801b4fef710b9b9b0e9193.

Reverted https://github.com/pytorch/pytorch/pull/140895 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think test_torchbind_inductor is failing in trunk after this lands ([comment](https://github.com/pytorch/pytorch/pull/140895#issuecomment-2484679319))
2024-11-19 04:25:41 +00:00
c79e78b503 [inductor] Refactor MutableBox to make IRNode typing easier (#140895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-11-19 00:24:35 +00:00
819b0ebd94 cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411)
Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use.

Closes #135559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411
Approved by: https://github.com/desertfire

Co-authored-by: Bin Bao <binbao@meta.com>
2024-11-17 04:16:59 +00:00
a173186566 [RFC] Implement caching for user defined triton kernels (#140326)
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.

One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.

Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
2024-11-16 02:37:16 +00:00
baf756a785 [reland] [aoti] Selectively package AOTI generated files (#140675)
Summary: Reland  https://github.com/pytorch/pytorch/pull/140022

Test Plan: CI

Differential Revision: D65929964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140675
Approved by: https://github.com/desertfire
2024-11-15 23:48:34 +00:00
222d4b48b1 Revert "cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411)"
This reverts commit 761b42bc085190e272a930847694e872d92a1255.

Reverted https://github.com/pytorch/pytorch/pull/139411 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))
2024-11-14 19:25:46 +00:00
85deef9ede [AOTI][refactor] Rename generate_extern_kernel_alloc_and_find_schema_if_needed (#140447)
Summary: Rename generate_extern_kernel_alloc_and_find_schema_if_needed to better reflect its meaning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140447
Approved by: https://github.com/chenyang78
2024-11-14 01:40:58 +00:00
26fde110db Refactor user-defined triton kernel source code collection (#140577)
Differential Revision: [D65895743](https://our.internmc.facebook.com/intern/diff/D65895743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140577
Approved by: https://github.com/zou3519
2024-11-13 22:12:17 +00:00
b4cc5d38b4 Revert "[aoti] Remove dir after packaging (#140022)"
This reverts commit ba136a78ba613d3c7f5d2de53b9fff556e04cfba.

Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2473814720))
2024-11-13 14:43:15 +00:00
ba136a78ba [aoti] Remove dir after packaging (#140022)
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
2024-11-13 12:17:19 +00:00
d48ea29b9a Revert "[aoti] Remove dir after packaging (#140022)"
This reverts commit 8c6abe5a8c42be3909496d2cd3d1f194a8493460.

Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2471847439))
2024-11-12 23:35:27 +00:00
1f590feaf7 [AOTI][refactor] Update codegen_int_array_var API (#140299)
Summary: codegen_int_array_var and codegen_reinterpret_view need to call different writeline functions depending on which part of code it's writing. Previously their APIs take a writer and implicitly assign a default writer if needed, which is not intuitive. Update their APIs to explicitly take a writeline function.

Differential Revision: [D65774584](https://our.internmc.facebook.com/intern/diff/D65774584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140299
Approved by: https://github.com/frank-wei, https://github.com/chenyang78
2024-11-12 21:39:41 +00:00
8c6abe5a8c [aoti] Remove dir after packaging (#140022)
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
2024-11-12 21:36:24 +00:00
761b42bc08 cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411)
Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use.

Closes #135559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411
Approved by: https://github.com/desertfire
2024-11-12 15:22:38 +00:00
2c77352fe2 [AOTI][refactor] Clean up call chain in wrapper codegen (#136531)
Summary: For cpp wrapper, generate_kernel_call and define_kernel need to handle both cpu and gpu kernels. Refactor the code to remove nested super() calls.

Differential Revision: [D65639095](https://our.internmc.facebook.com/intern/diff/D65639095)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136531
Approved by: https://github.com/frank-wei
2024-11-11 22:00:42 +00:00
838958de94 [inductor] Support autotune restore_value for user-defined Triton kernels (#139851)
This PR adds support for the `restore_value` argument of the
`@triton.autotune` for the user-defined Triton kernels in PT2.

The `kernel.restore_idx` are extracted in the
`ir.UserDefinedTritonKernel` and the corresponding arg names are
placed into the `triton_meta["restore_value"]`. From there, those
are added to the existing `mutated_arg_names` in the caching autotuner
infra which already exists and leads to the listed argss being cloned.
This achieves the equivalent effect to the native `restore_value`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139851
Approved by: https://github.com/oulgen
2024-11-08 14:59:00 +00:00