3255e7872b
Enable all flake8-logging-format rules ( #164655 )
...
These rules are enabled by removing existing suppressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655
Approved by: https://github.com/janeyx99 , https://github.com/mlazos
2025-10-19 00:59:28 +00:00
f02e3947f6
Expand type checking to mypy strict files ( #165697 )
...
Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697
Approved by: https://github.com/ezyang
2025-10-18 04:34:45 +00:00
2928c5c572
Revert "Pyrefly suppressions 2 ( #165692 )"
...
This reverts commit 43d78423ac224cce432bf34ed9627035169d5433.
Reverted https://github.com/pytorch/pytorch/pull/165692 on behalf of https://github.com/seemethere due to This is causing merge conflicts when attempting to land internally, see D84890919 for more details ([comment](https://github.com/pytorch/pytorch/pull/165692#issuecomment-3416397240 ))
2025-10-17 17:13:04 +00:00
43d78423ac
Pyrefly suppressions 2 ( #165692 )
...
This is the last directory to opt in for the regular mypy.ini file. Will put up a diff to remove unused ignores before making sure we're also type checking all the files in the mypy strict configurations
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199
after:
INFO 0 errors (6,884 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165692
Approved by: https://github.com/oulgen
2025-10-17 04:15:25 +00:00
5641de7b6b
Add suppressions for _inductor/codegen ( #165659 )
...
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199
after:
INFO 0 errors (6,884 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165659
Approved by: https://github.com/oulgen
2025-10-16 21:37:37 +00:00
fbe0d20a17
[2/N] More ruff SIM fixes ( #165031 )
...
This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031
Approved by: https://github.com/mlazos
2025-10-14 14:22:54 +00:00
ac529df244
Native matmul ( #157743 )
...
### Implementation of #151705
This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates.
To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705 :
1. **Basic support** (this PR)
2. **Lazy broadcasting** for optimal performance (future PR)
### Summary of This PR
This PR implements the basic functionality. It does **not** include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead.
### Notable Changes
1. Adds a new config flag: `config.triton.enable_native_matmul`
2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled
3. Enforces tililng suitable for matmul when the native matmul flag is enabled
4. Implements code generation for `ops.dot`
5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this.
@eellison @jansel @PaulZhang12 @shunting314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743
Approved by: https://github.com/jansel
2025-10-14 04:22:30 +00:00
b8be796a57
Revert "[2/N] More ruff SIM fixes ( #165031 )"
...
This reverts commit 38095fbd1323ee4a9541fbcbb9b28bd20f2cd956.
Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870 ))
2025-10-10 13:42:14 +00:00
38095fbd13
[2/N] More ruff SIM fixes ( #165031 )
...
This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031
Approved by: https://github.com/mlazos
2025-10-10 05:37:46 +00:00
600267ea56
Add num_store to inductor_meta and use it to scale persistent reduction x block ( #162446 )
...
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores
<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3 " />
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/ )
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0 , https://github.com/eellison , https://github.com/shunting314
ghstack dependencies: #162296
2025-10-06 14:29:07 +00:00
5d7360bb03
Revert "Enable all SIM rules except disabled ones ( #164645 )"
...
This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911.
Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351 ))
2025-10-05 19:32:21 +00:00
321e602692
Enable all SIM rules except disabled ones ( #164645 )
...
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang
2025-10-05 07:38:25 +00:00
8c590cab9d
[inductor] add a runtime assert for triton shapes ( #164242 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164242
Approved by: https://github.com/eellison , https://github.com/mlazos
ghstack dependencies: #164241
2025-10-01 18:55:33 +00:00
20edc5b26a
Revert "Add num_store to inductor_meta and use it to scale persistent reduction x block ( #162446 )"
...
This reverts commit 22c5e8c17c7551c9dd2855589ae774c1e147343a.
Reverted https://github.com/pytorch/pytorch/pull/162446 on behalf of https://github.com/PaulZhang12 due to perf regression in https://github.com/pytorch/pytorch/issues/164301#issuecomment-3354028620 ([comment](https://github.com/pytorch/pytorch/pull/162446#issuecomment-3357164274 ))
2025-10-01 16:23:03 +00:00
8c98aee436
[Inductor] Update DeviceAssert op to behave like store ( #163696 )
...
Updated the DeviceAssert operation to match the behavior of Store, it will fixes the issue mentioned in [this PR](https://github.com/pytorch/pytorch/pull/163023 ) and updated testcases as Elias [suggested](https://github.com/pytorch/pytorch/pull/160677#discussion_r2353834646 ).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163696
Approved by: https://github.com/mlazos
2025-09-24 23:35:56 +00:00
3e1b1a30f2
Revert "[inductor] Fix issue with scalar arg handling" ( #163737 )
...
This reverts commit a8cd437183142e17ba6fc8d7b5e9dcee462d7904.
See https://github.com/pytorch/pytorch/pull/163481#issuecomment-3326310774
This PR might also cause issues with cudagraphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163737
Approved by: https://github.com/ezyang
ghstack dependencies: #163386 , #163398 , #163387 , #163414 , #163415 , #163419 , #163434 , #163393 , #163412 , #163422 , #163481 , #163520 , #163482
2025-09-24 07:33:12 +00:00
ca512af3e7
[inductor] Fix issue with scalar arg handling ( #163481 )
...
Fixes #163420
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163481
Approved by: https://github.com/eellison
ghstack dependencies: #163386 , #163398 , #163387 , #163414 , #163415 , #163419 , #163434 , #163393 , #163412 , #163422
2025-09-24 02:52:36 +00:00
22c5e8c17c
Add num_store to inductor_meta and use it to scale persistent reduction x block ( #162446 )
...
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores
<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3 " />
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/ )
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0 , https://github.com/eellison , https://github.com/shunting314
ghstack dependencies: #162296
2025-09-23 20:36:39 +00:00
25f1a5d8d1
[inductor][ez] add src_hash property for Templates ( #161468 )
...
# why
enable caching/overriding/filtering based on src hash later
# what
- KernelTemplate has a src_hash that is None by default
- sha256 on TritonTemplate of the template src code
- None on ExternKernelChoice to have same API
# testing
n/a (not in use in this change)
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/ )
Differential Revision: [D81821149](https://our.internmc.facebook.com/intern/diff/D81821149 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161468
Approved by: https://github.com/eellison
ghstack dependencies: #161351 , #161350 , #162293
2025-09-12 21:10:45 +00:00
9aedb3cd87
[AOTI-FX] Support registering custom FX backends ( #162317 )
...
# Feature
Currently, `torch._inductor.compile_aot` always uses the `WrapperFxCodegen` class. In contrast, Python and C++ codegen allow users to register custom backends. This PR brings that feature to FX codegen.
# Test plan
Added a CI test registering a custom FX backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162317
Approved by: https://github.com/jansel
2025-09-06 07:32:03 +00:00
771f369448
[Inductor] Improve RoPE ( #161420 )
...
This PR fuses ROPE from 2 kernels into 1 kernel.
Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```
`Hq=32, Hkv=8, D=128` following Llama3 setting.
<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9 " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161420
Approved by: https://github.com/shunting314
2025-09-05 20:55:20 +00:00
d63ad53a99
[inductor][ez] return choicecallers directly ( #161345 )
...
# why
- remove repeat patterns
- we have everything to make the choicecallers
- templates
- input_nodes
- layouts
- all the kwargs
# what
- yield a choicecaller directly from V.choices.get_mm_configs
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520577](https://our.internmc.facebook.com/intern/diff/D81520577 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161345
Approved by: https://github.com/jansel
ghstack dependencies: #162075 , #161340 , #161341 , #161342 , #161343 , #161344
2025-09-05 18:02:38 +00:00
4902c76c65
[inductor][ez] add template/externchoice uid ( #161341 )
...
# why
- to have a central registry of templates/externkernelchoice
to match them to heuristics etc, they need unique names
- mm is both the triton template name and the aten_mm name
# what
- add a uid() to KernelTemplate/ExternKernelChoice that returns name
- override in ExternKernel to prepend "aten::"
- override in TritonTemplate to prepend "triton::"
This id is just use to find template heuristics, so it has no other
impact
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520579](https://our.internmc.facebook.com/intern/diff/D81520579 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161341
Approved by: https://github.com/jansel , https://github.com/eellison
ghstack dependencies: #162075 , #161340
2025-09-05 18:01:58 +00:00
f305019377
[inductor] propagate shapes in CSEVariable ( #152198 )
...
Fixes #149905
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198
Approved by: https://github.com/eellison
2025-08-19 16:46:38 +00:00
bab79824cb
[aoti-fx] Initial AOTInductor FX ( #160765 )
...
Using the existing WrapperFxCodegen backend, this PR prototypes an AOT version of it which will directly return a graph module.
How to use:
```python
exported_gm = torch.export.export(model, inp, dynamic_shapes=dynamic_shapes).module()
compiled_gm = torch._inductor.aot_compile(
exported_gm, inp, options={"fx_wrapper": True, "compile_threads": 1}
)
assert torch.allclose(model(*inp), compiled_gm(*inp))
```
The motivation behind this is that backends like ExecuTorch/MTIA would like to use inductor's optimization technologies, but might have their own graph lowering pipelines so they might not want to use AOTI (which generates an so).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160765
Approved by: https://github.com/jansel
2025-08-18 18:14:08 +00:00
62bac07981
[inductor][triton] support profile_scratch launcher arg ( #159772 )
...
This adds support for Triton after https://github.com/triton-lang/triton/pull/7258 landed. https://github.com/triton-lang/triton/pull/7258 adds a new argument to all the Triton kernels - a profile_scratch argument, similar to global_scratch. This PR updates the static cuda launcher and the AOTI kernel callers to pass in these arguments when calling the Triton kernel.
Tests: https://github.com/pytorch/pytorch/pull/159158 . I also verified these test locally with triton 3.2, 3.3, and 3.4.
Fixes:
* static_cuda_launcher (test/repro: `python tools/dynamo/verify_dynamo.py`)
* AOTI calling logic (test/repro: `TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_linalg_vander_cuda_float32`)
Differential Revision: [D79825121](https://our.internmc.facebook.com/intern/diff/D79825121 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159772
Approved by: https://github.com/NikhilAPatel , https://github.com/eellison
2025-08-08 14:27:38 +00:00
e167c7d0f3
[inductor] allocate non-blocking copy destinations in pinned memory ( #155121 ) ( #158758 )
...
Fixes #155121
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758
Approved by: https://github.com/EikanWang , https://github.com/eellison
2025-08-07 17:07:26 +00:00
83ba3f1101
Revert "[inductor] allocate non-blocking copy destinations in pinned memory ( #155121 ) ( #158758 )"
...
This reverts commit 6085bf7565fec0d2ed26e8590001f09c05adbbe4.
Reverted https://github.com/pytorch/pytorch/pull/158758 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371 ))
2025-08-04 21:47:11 +00:00
6085bf7565
[inductor] allocate non-blocking copy destinations in pinned memory ( #155121 ) ( #158758 )
...
Fixes #155121
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758
Approved by: https://github.com/EikanWang , https://github.com/eellison
2025-08-04 21:22:11 +00:00
c55e72bea1
[Re-land][Inductor] Support native Inductor as backend for MTIA ( #159211 )
...
The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526 ) was reverted due to this docstring lint error:
<img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54 " />
I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function.
So this diff/PR is an exactly copy of the previous one, except for adding the docstring.
-------------
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.
The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78 ) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.
Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.
Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/ )
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb )
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w )
- [early prototying diff](https://www.internalfb.com/diff/D75110196 )
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959 )
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678 )
Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211
Approved by: https://github.com/eellison , https://github.com/blaine-rister , https://github.com/jansel
2025-07-29 17:03:24 +00:00
fe0ff12dab
Revert "[Inductor] Support native Inductor as backend for MTIA ( #158526 )"
...
This reverts commit cd68559d0451185f8521912c23e77b83d76b87cf.
Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057 ))
2025-07-26 17:58:00 +00:00
cd68559d04
[Inductor] Support native Inductor as backend for MTIA ( #158526 )
...
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.
The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78 ) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.
Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.
Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/ )
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb )
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w )
- [early prototying diff](https://www.internalfb.com/diff/D75110196 )
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959 )
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678 )
Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526
Approved by: https://github.com/blaine-rister , https://github.com/jansel , https://github.com/eellison
2025-07-26 08:16:34 +00:00
d3d9bc1c31
[inductor] Allow backends to register their own custom config object ( #158254 )
...
An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254
Approved by: https://github.com/eellison
2025-07-23 15:56:06 +00:00
b6c00dfe24
[user triton] AOT inductor support for device-side TMA ( #155896 )
...
Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`
Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.
To support this in AOTI, this PR:
* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
* allocates global scratch, if needed (cuda/device_op_overrides.py)
* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)
* updates tests to verify this works for dynamically shaped inputs
This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)
Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).
For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896
Approved by: https://github.com/desertfire
2025-06-27 04:28:04 +00:00
6ff6630375
[BE][3/16] fix typos in torch/ (torch/_inductor/) ( #156313 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-23 02:57:12 +00:00
f1331f3f1b
Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) ( #156313 )"
...
This reverts commit 3627270bdf17b0fb6f528ca1cb87d6f2ec32680a.
Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912 ) [HUD commit link](c95f7fa874
) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213 ))
2025-06-22 12:31:57 +00:00
3627270bdf
[BE][3/16] fix typos in torch/ (torch/_inductor/) ( #156313 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-22 08:43:09 +00:00
bb1f3d1a55
[MPSInductor] Improve _default
dtype inference ( #156121 )
...
By just adding 'mps' as one of the backend options and fixing reduction op to actually return tuple of CSEVariable's rather than tuple of strings
Test plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156121
Approved by: https://github.com/dcci
2025-06-16 23:11:53 +00:00
bc9b8ea230
[user triton] JIT inductor support for new host-side TMA api ( #155814 )
...
This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api.
* handle TensorDescriptor.from_tensor in ir.py
* codegen TensorDescriptor.from_tensor in wrapper.py
* generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator)
AOTI support is not implemented yet.
Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814
Approved by: https://github.com/aakhundov
ghstack dependencies: #155777
2025-06-15 20:24:19 +00:00
ce79056471
Custom FX pass for inductor's backend registration ( #154841 )
...
This PR is related to RFC #153532 . It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel
Co-authored-by: Jason Ansel <jansel@jansel.net >
2025-06-14 17:29:54 +00:00
d1947a8707
Migrate from lru_cache to cache ( #155613 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
79bdafe5b6
Revert "Custom FX pass for inductor's backend registration ( #154841 )"
...
This reverts commit e694280d1215caf70f41575f2611bfa26c69ebdb.
Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711 ))
2025-06-09 16:56:45 +00:00
e694280d12
Custom FX pass for inductor's backend registration ( #154841 )
...
This PR is related to RFC #153532 . It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel
Co-authored-by: Jason Ansel <jansel@jansel.net >
2025-06-06 06:49:44 +00:00
0827464002
Replace runtime type parameterization ( #155221 )
...
See:
```
>>> import timeit; print(f"OrderedSet[str](): {timeit.timeit('OrderedSet[str]()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s, OrderedSet(): {timeit.timeit('OrderedSet()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s")
```
> `OrderedSet[str]()`: 0.354622s, OrderedSet(): 0.095376s
Type parameterization should be on type hint, not in runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155221
Approved by: https://github.com/Skylion007 , https://github.com/jansel
2025-06-05 21:43:54 +00:00
26471fc203
[aoti] Initial Metal support ( #153959 )
...
An example generated file: P1816629015
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet , https://github.com/desertfire
ghstack dependencies: #153964
2025-05-23 05:45:35 +00:00
47a01f3efb
Revert "[aoti] Initial Metal support ( #153959 )"
...
This reverts commit 28bcd9eb30336b370298dbe9677b95019882f2a8.
Reverted https://github.com/pytorch/pytorch/pull/153959 on behalf of https://github.com/angelayi due to previous PR broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153959#issuecomment-2901825315 ))
2025-05-22 16:17:07 +00:00
28bcd9eb30
[aoti] Initial Metal support ( #153959 )
...
An example generated file: P1816629015
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet , https://github.com/desertfire
ghstack dependencies: #153964
2025-05-21 21:55:59 +00:00
8568dbce1d
[inductor] Clean typing in codegen/common.py and codecache.py ( #150767 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150767
Approved by: https://github.com/aorenste
2025-05-17 13:56:50 +00:00
33a5179269
[AOTI][reland2] Remove typedef for half and bfloat16 ( #153467 )
...
Summary:
Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues.
typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen.
Differential Revision: D74398762
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467
Approved by: https://github.com/jingsh , https://github.com/henrylhtsang , https://github.com/cyyever
2025-05-14 02:37:18 +00:00
9fa07340fd
[Cutlass] Implement memory planning for EVT ( #153177 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153177
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #153196 , #150907
2025-05-09 05:39:05 +00:00