56 Commits

Author SHA1 Message Date
c73f5080de Migrating some more callsites (#163580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163580
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #165582
2025-10-19 15:52:17 +00:00
9ccd0f5e31 Fix unbacked symint and memory leak in inductor memory planning (#159839)
Summary:

In memory planning, some allocation sizes involve unbacked symints. These unbacked symints are not known before they are computed in run time, so **allocation pools that involve unbacked symints cannot be allocated until we have the values of the unbacked symints** .

So we add a notion of `earliest_available` to Allocation nodes. If an allocation node has unbacked symint, it is available at only when its live range begin.

Then in AllocationPool, if a pool involves an Allocation node that has an earliest available time, we restrict its life range.

If a block's earliest available time is later than a pool's life range's start time, we cannot allocate it from the pool.

We also fix a memory leak that's caused by allocating tensor without wrapping it with RAIIAtenTensor.

In python wrapper for JIT inductor, `codegen_alloc_from_pool` doesn't actually write the alloc lines to wrapper, it just returns the string to alloc. However, in cpp_wrapper, `codegen_alloc_from_pool`  actually write to the wrapper. Specifically, it writes the following and returns string `RAIIAtenTensorHandle`.

```
AtenTensorHandle handle_name;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__alloc_from_pool(....);
```

This is bug prune. **If you write aoti_torch__alloc_from_pool lines, you must write the RAIIAtenTensorHandle as well**, otherwise you get memory leaks.

We remove the alloc_from_pool call from codegen_create, because this doesn't work for AOTI. In python wrapper, we can generate the same alloc_from_pool variable name for the same block, but cpp_wrapper will generate a different variable name for each call to alloc_from_pool.

Test Plan:
```
 python test/inductor/test_memory_planning.py
```

Rollback Plan:

Differential Revision: D79603119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159839
Approved by: https://github.com/jansel
2025-08-11 17:16:15 +00:00
f5e6e52f25 [BE][PYFMT] migrate PYFMT for test/inductor/ to ruff format (#148186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186
Approved by: https://github.com/jansel
2025-06-24 11:12:11 +00:00
c0a0761871 [Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.

# Feature

This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below.

![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b)

The IR currently supports the following ops:
- All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.)
- Reinterpret views (`ReinterpretLine`)
- Kernel definitions (`KernelDefinitionLine`)
- Calls to defined kernels (`KernelCallLine`)
- Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`)
- Ops with multiple outputs (`MultiOutputLine`)
- Tensor cleanup at the end of a graph (`FreeLine`)
- Leaving comments in code (`CommentLine`)

There are two main motivations for this refactor:
1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`.
2. This design will hopefully promote stronger modularity and encapsulation.
   a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends.
   b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code.

# Implementation details

The implementation mainly consists of separating direct C++/Python codegen into two phases:
 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do.
 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code.

The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to  `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros.

# Test plan

Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage.

The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage.

# Future directions

These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files:
 - User-defined Triton kernels.
 - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.)
 -  Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`.
 -  Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants.

These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes.

One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor.

Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458
Approved by: https://github.com/eellison
2025-04-15 17:28:36 +00:00
8157e76b79 Revert "[Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)"
This reverts commit fe7f425de7b76ef33d308d0a03779b97a914d186.

Reverted https://github.com/pytorch/pytorch/pull/150458 on behalf of https://github.com/clee2000 due to broke a lot of tests internally? D72906459 ([comment](https://github.com/pytorch/pytorch/pull/150458#issuecomment-2799578597))
2025-04-13 03:52:42 +00:00
fe7f425de7 [Inductor] Refactor wrapper codegen to use Wrapper IR. (#150458)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.

# Feature

This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below.

![wrapper_ir](https://github.com/user-attachments/assets/a61db21b-caf3-45d2-bfdb-91066ae4ba6b)

The IR currently supports the following ops:
- All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.)
- Reinterpret views (`ReinterpretLine`)
- Kernel definitions (`KernelDefinitionLine`)
- Calls to defined kernels (`KernelCallLine`)
- Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`)
- Ops with multiple outputs (`MultiOutputLine`)
- Tensor cleanup at the end of a graph (`FreeLine`)
- Leaving comments in code (`CommentLine`)

There are two main motivations for this refactor:
1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`.
2. This design will hopefully promote stronger modularity and encapsulation.
   a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends.
   b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code.

# Implementation details

The implementation mainly consists of separating direct C++/Python codegen into two phases:
 1. Emit Wrapper IR lines describing what the wrapper code is supposed to do.
 2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code.

The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to  `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros.

# Test plan

Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage.

The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage.

# Future directions

These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files:
 - User-defined Triton kernels.
 - Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.)
 -  Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`.
 -  Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants.

These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes.

One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor.

Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458
Approved by: https://github.com/eellison
2025-04-12 01:15:19 +00:00
f649ee73ce Use source hashing to generate consistent symbolic ids (#149665)
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows

Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic

Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....

Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.

We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
2025-03-28 05:36:32 +00:00
94d761fbf0 [AOTI][reland] Update test runner to use the new APIs (#149412)
Summary: Reland https://github.com/pytorch/pytorch/pull/147105. Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring.

Differential Revision: [D71470265](https://our.internmc.facebook.com/intern/diff/D71470265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149412
Approved by: https://github.com/yushangdi
2025-03-19 17:56:44 +00:00
405025778d Revert "[AOTI] Update test runner to use the new APIs (#147105)"
This reverts commit 9a78513c3cb21a5f506135e2a56f967cf1fddc60.

Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413))
2025-03-18 15:25:40 +00:00
9a78513c3c [AOTI] Update test runner to use the new APIs (#147105)
Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring.

Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105
Approved by: https://github.com/jingsh
2025-03-18 00:27:09 +00:00
b160dda743 cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403)
This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode.

Changes:

1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions.
2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns.
3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive.

The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression.

Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403
Approved by: https://github.com/desertfire
2025-03-06 16:08:16 +00:00
577708e6de Unskipped multiple inductor tests for ROCm (#143581)
All of them should be fine to run now after the triton fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-16 20:46:06 +00:00
7d9f26de05 Revert "Unskipped multiple inductor tests for ROCm (#143581)"
This reverts commit e05d67790ee4a53c310322829631c000f0ac2985.

Reverted https://github.com/pytorch/pytorch/pull/143581 on behalf of https://github.com/huydhn due to There is some tests failing on ROCm jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143581#issuecomment-2577163274))
2025-01-08 09:15:14 +00:00
e05d67790e Unskipped multiple inductor tests for ROCm (#143581)
All of them should be fine to run now after the triton fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581
Approved by: https://github.com/jataylo, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-08 03:55:33 +00:00
44186a0a4e Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-26 18:11:00 +00:00
4742080ed9 [AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-11-26 11:51:32 +00:00
f23621ec56 Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597)"
This reverts commit c25b201583fc28243b87c460a2f18e2531a676e7.

Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))
2024-11-22 15:43:39 +00:00
c25b201583 Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-22 02:04:36 +00:00
cc6c248919 [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 2) (#136856)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_inductor_freezing.py`
reuse `test/inductor/test_layout_optim.py`
reuse `test/inductor/test_loop_ordering.py`
reuse `test/inductor/test_memory_planning.py`
reuse `test/inductor/test_padding.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136856
Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/jansel
2024-10-18 03:58:00 +00:00
443472b1ca [AOTI] Remove explicit abi_compatible setting in tests (#138016)
Differential Revision: [D64439674](https://our.internmc.facebook.com/intern/diff/D64439674)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138016
Approved by: https://github.com/malfet
ghstack dependencies: #137982
2024-10-16 21:35:46 +00:00
6bc57549f9 [AOTI] Remove non-ABI-compatible tests (#137982)
Summary: Remove non-ABI-compatible mode tests since ABI-compatible has been turned on as default. Also clean up tests that explicitly set ABI-compatible to True.

Differential Revision: [D64439673](https://our.internmc.facebook.com/intern/diff/D64439673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137982
Approved by: https://github.com/malfet
2024-10-16 21:35:46 +00:00
71aac59e93 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-30 20:24:52 +00:00
36428f91e9 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit 31c0467594c7c41c8e8ff1828bf01fa31fc4454f.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))
2024-09-27 16:54:27 +00:00
31c0467594 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-26 15:35:26 +00:00
7f98781f84 Fix autodeps from D62049222 that pyfmt broke (#136455)
Summary: `arc lint` changed the formatting which then caused autodeps to be confused.

Test Plan:
this passes:
```
arc lint --skip AUTODEPS
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/test/inductor/test_memory_planning.py
```

Differential Revision: D63277059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136455
Approved by: https://github.com/bobrenjc93, https://github.com/oulgen
2024-09-24 05:06:12 +00:00
d0cebedb31 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
e498b02b47 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel
2024-09-14 21:45:19 +00:00
8c356ce3da Fix lint errors in fbcode (#135614)
Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps.  After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports.

Test Plan:
```
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS
```
Before:
```
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib.  Some things to try:
```

Differential Revision: D62049222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614
Approved by: https://github.com/oulgen, https://github.com/laithsakka
2024-09-13 02:04:34 +00:00
fbf3fc2a30 [inductor] Use int64_t as index type for all platfroms 4 (#133892)
It is parallel PR to https://github.com/pytorch/pytorch/pull/133819 , and it is append change for @jansel 's comments.
1. For `torch/_inductor/codegen/cpp_wrapper_cpu.py`, revert to origin code to append LL on MacOS and Windows: bdc14ad89a
2. For `torch/_inductor/codegen/cpp_utils.py`, append LL on MacOS and Windows forlarge constants. And fix its UTs: 3a56b76ce0

------------------------------
Another solution for https://github.com/pytorch/pytorch/pull/133615, use `int64_t` as index type for all plartform.

### Development notes:
The metioned PR( https://github.com/pytorch/pytorch/pull/133615) is fix the index type not match to parse_arg args types. As reviewed with @jansel , Jason think we need to unificate `INDEX_TYPE` for all platforms.
Current code is make code cumbersome:
```python
INDEX_TYPE = "int64_t" if _IS_WINDOWS else "long"
```

So, I have some attempts to unificate `INDEX_TYPE` as `long` or `int64_t`.
For use `long` as index type: https://github.com/pytorch/pytorch/pull/133768
For use `int64_t` as index type: https://github.com/pytorch/pytorch/pull/133782

Since that, we still discussed which type we will select as final solution.
![image](https://github.com/user-attachments/assets/b23fa577-2d40-4bd6-b934-fb7994fe0bb0)

`long` type is different define and size in different OSs and different compilers. So, @jansel make decision that, we need to select `int64_t` for all platforms. So, I would comtine my work based on https://github.com/pytorch/pytorch/pull/133782.

As https://github.com/pytorch/pytorch/pull/133782 still has two issues:
1. std::min/std::max could not match function instances by arg types. It as fixed and validated in PR: https://github.com/pytorch/pytorch/pull/133812
4. Cuda TestMemoryPlanning::test_cpp_wrapper issue by wrong index type. It is fixing in this PR.

So, we made final solution in this PR.

### Changes:
**1. Use `int64_t` type as index type for all OSs: `Windows`, `Linux` and `MacOS`.**
**2. Use static_cast<int64_t>(`constant`) to convert constant to `div_floor_integer` with args type(`int64_t`).**
**3. Update `parse_arg` function signature to `int64_t`, which follow the index type.**
**4. Append double L(`LL`) to constant on Windows and MacOS, because of their int64_t are are long long.**
**5. Fix `std::min/std::max` type miss match by static_cast to `INDEX_TYPE`.**
**6. Fix UTs, containts: cuda `TestMemoryPlanning::test_cpp_wrapper`, and `test_indexing.py`.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133892
Approved by: https://github.com/jansel
2024-08-20 16:54:12 +00:00
ceab3121de [inductor] Fix flaky tests in test_memory_planning.py (#131703)
Summary: Internally, the ABI-compatible mode is [enabled by default](eb54ca7abe/torch/_inductor/config.py (L53)). As a result, when the `abi_compatible: False` flag is not specified explitictly in the tests assuming non-ABI-compatible C++ codegen, those are failing internally. Here we fix one such test in `test_memory_planning.py`.

Test Plan: Tested internally.

Differential Revision: D60197327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131703
Approved by: https://github.com/eellison
2024-07-25 05:09:08 +00:00
134bc4fc34 [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 07:49:19 +00:00
b732b52f1e Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)"
This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d.

Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))
2024-07-18 06:39:58 +00:00
aecc746fcc [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 05:13:41 +00:00
ded469cfbd [issue scrubbing] Fix imports in test_memory_planning.py to work with pytest (#130275)
Summary: I actually don't grok why this pattern works; I guess pytest expects a different import syntax for these relative imports?? But this pattern is used in many other tests here (notably `test_aot_inductor.py`), so it must be right ;)

Test Plan:
Ran both ways:
* `python test/inductor/test_memory_planning.py`
* `pytest test/inductor/test_memory_planning.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130275
Approved by: https://github.com/zou3519
2024-07-09 05:20:56 +00:00
e2610240f9 [ROCm] Enable several inductor UTs (#127761)
Fixes #ISSUE_NUMBER

Needs https://github.com/pytorch/pytorch/pull/125396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127761
Approved by: https://github.com/peterbell10, https://github.com/pruthvistony
2024-06-12 22:47:45 +00:00
a4064da8ca Always simplify sympy expressions before printing. (#127543)
This is important because if a replacement has happened during inductor lowering, we may have stale symbols in sympy expressions that we need to replace away.  Do this at the very end.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127543
Approved by: https://github.com/lezcano
2024-06-03 20:36:14 +00:00
82c66bc41a Make 'pytest test/inductor/test_memory_planning.py' work (#126397)
There's still another naughty direct test_* import, I'm out of patience
right now though.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126397
Approved by: https://github.com/peterbell10, https://github.com/int3
2024-05-16 20:28:20 +00:00
3c09c6b91a Fix memory planning compile error (#123867)
Summary:
We should be using CppPrinter in the cpp wrapper codegen, not the ExprPrinter (which prints expressions for Python)

Not really a memory-planning-specific bug, but exposed by mem planning because it tends to emit more complicated expressions

Differential Revision: D56025683

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123867
Approved by: https://github.com/hl475, https://github.com/chenyang78
2024-04-12 17:34:58 +00:00
9743e3a19c [Inductor Intel GPU backend Upstream] Add Inductor Intel GPU backend. (#121895)
As the design in RFC https://github.com/pytorch/pytorch/issues/114856, this PR implemented Intel GPU Inductor backend by:
- Reuse WrapperCodegen and TritonScheduling for python wrapper and kernel code generation. And implenented device-specific code generation in XPUDeviceOpOverrides
- Reuse fx_pass, lowering, codecache, triton kernel auto-tuning, and compilation.

For the test case, this PR provided test/inductor/test_xpu_basic.py for basic inductor backend functionality testing.
We'll reuse all the existing Inductor test case in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121895
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-04-05 09:05:11 +00:00
4cd503c1f3 Enable FX graph cache for a batch of inductor tests (#121696)
Summary: Get more FX graph cache coverage by enabling it for these unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696
Approved by: https://github.com/eellison
2024-03-14 03:39:59 +00:00
e868a7fedd [AOTI] Rename config.aot_inductor.abi_compatible (#119065)
Summary: Rename config.aot_inductor.abi_compatible to config.abi_compatible, since the cpp_wrapper mode in JIT Inductor will share the same flag.

Differential Revision: [D53478752](https://our.internmc.facebook.com/intern/diff/D53478752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119065
Approved by: https://github.com/khabinov
2024-02-07 00:14:33 +00:00
0d5f53a2f9 fix forward test_memory_planning.py (#119109)
Summary: fixes a broken test, also makes it run in fbcode correctly

Test Plan: test

Reviewed By: angelayi

Differential Revision: D53373709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119109
Approved by: https://github.com/angelayi
2024-02-04 21:45:07 +00:00
e332653eb3 [inductor] Use at::detail::empty_strided_* in cpp_wraper mode (#118490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118490
Approved by: https://github.com/desertfire
2024-01-30 21:03:19 +00:00
2de24c11f6 [inductor] Slightly faster memory allocation on CUDA (#118255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255
Approved by: https://github.com/peterbell10
ghstack dependencies: #118065, #118070, #118171
2024-01-25 20:49:14 +00:00
b14d57ceda Replace constraints with dynamic_shapes in scripts/sijiac/prototypes and test/inductor (#117915)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `scripts/sijiac/prototypes` and `test/inductor`.

Test Plan: buck test mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor

Differential Revision: D52931743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117915
Approved by: https://github.com/angelayi
2024-01-22 21:24:03 +00:00
70f3a530d7 [AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269)
Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR.

Test Plan: CI

Reviewed By: khabinov

Differential Revision: D52339116

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269
Approved by: https://github.com/khabinov
2024-01-04 18:58:24 +00:00
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
6bffde99b0 Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit 66d09f82170c528698b5ec606ba7838268ae1f8a.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1811666004))
2023-11-15 01:44:26 +00:00
1e60174891 Revert "[dynamo] Add run_inductor_tests entrypoint (#113278)"
This reverts commit b00311ce9e430cf1b98d2103e21ed2179450a424.

Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))
2023-11-15 01:19:48 +00:00
b00311ce9e [dynamo] Add run_inductor_tests entrypoint (#113278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278
Approved by: https://github.com/yanboliang
2023-11-11 08:54:43 +00:00