38 Commits

Author SHA1 Message Date
b11d5cd584 [Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481)
This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #147347
2025-02-22 02:53:16 +00:00
a82bab6419 Run only listed tests on s390x (#140265)
Skip tests that are failing

This was previously part of https://github.com/pytorch/pytorch/pull/125401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140265
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-20 22:53:09 +00:00
8cfe28e4e3 [Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-10-17 09:06:57 +00:00
0b7ef196cd Use filelock to build extension_device backend one at a time (#137930)
Fixes https://github.com/pytorch/pytorch/issues/136125
Fixes https://github.com/pytorch/pytorch/issues/137026
Fixes https://github.com/pytorch/pytorch/issues/137027

The compilation fails during `setUpClass`, so disabling the test doesn't do nothing.  The theory I have for this flaky issue is that `test_open_device_registration` from both `TritonExtensionBackendTests` and `ExtensionBackendTests` are run in parallel and cleaned up while the other is still in fly, causing flaky failure.

Here is an example failure https://github.com/pytorch/pytorch/actions/runs/11331105492/job/31512603585#step:22:1710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137930
Approved by: https://github.com/malfet
2024-10-15 17:46:28 +00:00
c254901bdb Have Triton custom extension test use privateuseone device (#137611)
The original PR #122396 used the CPU device since at that point in time
there was no actual Triton CPU backend. After #133408, this is no longer
the case, so we now have multiple backends getting registered for the
CPU. The test still works in OSS but fails internally due to different
test runners initializing the backends in a different order.

This PR doesn't actually end up fixing the test internally because
cpp_extension -- needed to implement the privateuseone device -- isn't
supported there, so we simply skip it instead. However, it still makes the
OSS test independent of initialization order, which is good.

Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611
Approved by: https://github.com/henrylhtsang
2024-10-11 21:27:29 +00:00
b86269fab5 Unify cpp_extension build directory removal (#136059)
Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-10-03 06:22:11 +00:00
c13c7e11c5 Revert "[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)"
This reverts commit 6931c1644afdba53e63ce5671455e4e1b7265dd9.

Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its test_cpu_repro test is failing in trunk 6931c1644a ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2383563919))
2024-09-30 15:47:04 +00:00
6931c1644a [Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-30 00:53:18 +00:00
8c356ce3da Fix lint errors in fbcode (#135614)
Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps.  After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports.

Test Plan:
```
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS
```
Before:
```
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib.  Some things to try:
```

Differential Revision: D62049222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614
Approved by: https://github.com/oulgen, https://github.com/laithsakka
2024-09-13 02:04:34 +00:00
134bc4fc34 [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 07:49:19 +00:00
b732b52f1e Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)"
This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d.

Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))
2024-07-18 06:39:58 +00:00
aecc746fcc [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 05:13:41 +00:00
76259ebfdd [inductor] split cpu vec isa to dedicate file (keep git history) (#129789)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1

Changes:
1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`.
<img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92">

2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa.
3. Update code for above changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-02 05:29:05 +00:00
19e17216a2 Revert "[inductor] split cpu vec isa to dedicate file (keep git history) (#129789)"
This reverts commit 58f346c874a8a982679b4d4f3876602cc05d66d4.

Reverted https://github.com/pytorch/pytorch/pull/129789 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/129577 ([comment](https://github.com/pytorch/pytorch/pull/129789#issuecomment-2200545144))
2024-07-01 16:08:44 +00:00
58f346c874 [inductor] split cpu vec isa to dedicate file (keep git history) (#129789)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1

Changes:
1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`.
<img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92">

2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa.
3. Update code for above changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-29 07:19:54 +00:00
4bbadeee8a Revert "Set simdlen based on ATEN_CPU_CAPABILITY (#123514)"
This reverts commit b66e3f0957b96b058c9b632ca60833d9717a9d8a.

Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/clee2000 due to broke test/inductor/test_torchinductor.py::CpuTests::test_new_cpp_build_logical_cpu on periodic test on the no gpu tests b66e3f0957 https://github.com/pytorch/pytorch/actions/runs/9453518547/job/26040077301 ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2159433432))
2024-06-10 22:46:01 +00:00
b66e3f0957 Set simdlen based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Set simdlen based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-06-10 09:02:14 +00:00
68a1f787c8 [inductor][cpp] move some common cpp utils to cpp_utils.py (#125152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125152
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-05-06 04:30:30 +00:00
9dded148d0 Fix test_extension_backend on non-AVX systems (#117272)
The test checks for a substring "loadu" in generated code. On AVX systems that line is:
> auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i0))
however on non-AVX systems it is
> auto tmp0 = in_ptr0[static_cast<long>(i0)];

the difference depends on `codecache.valid_vec_isa_list()` being non-empty. See torch/_inductor/codegen/cpp.py:2639

Modify the test to account for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117272
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-24 02:55:12 +00:00
46903d978b fix maybe_initialize_device for custom device. (#121379)
1. fix maybe_initialize_device for custom device.
@wanchaol  @albanD

@albanD  I am very sorry that I have resubmitted a PR by new e-mail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121379
Approved by: https://github.com/albanD
2024-04-09 16:58:52 +00:00
f8eeae7aaa Enable CPP wrapper codegen registration (#121296)
Extend code gen registration for `CppWrapper`. W/ this PR, an new backend can register its specific `CppWrapper` at runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121296
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-26 06:51:03 +00:00
50036ec781 [Inductor] Add a test for creating a cpu inductor-> triton backend (#122396)
Summary: Currently there is a test for adding a backend in test/inductor/test_extension_backend.py for a cpp backend with a new device. However there is no such test for the Triton backend; it should be possible for a user to create and register your own ExtensionWrapperCodegen and ExtensionSchedulingfor another non-CUDA device and be able to generate Triton code. For simplicity I have chosen to use a CPU device, as I think it's plausible someone might want to create a CPU Triton backend.

Unfortunately the generation and running of the code is quite tightly coupled so I've had to use a mocked function to extract the code before running. Suggestions are welcome for better ways to do this.

This is a stepping off point for some additional PRs to make the Triton code path less CUDA specific, as currently there would be no way to test this avenue.

Test plan:
```
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('intermediate_hooks', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 0.394s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122396
Approved by: https://github.com/jansel
2024-03-23 01:14:57 +00:00
4cd503c1f3 Enable FX graph cache for a batch of inductor tests (#121696)
Summary: Get more FX graph cache coverage by enabling it for these unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696
Approved by: https://github.com/eellison
2024-03-14 03:39:59 +00:00
6bffde99b0 Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit 66d09f82170c528698b5ec606ba7838268ae1f8a.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1811666004))
2023-11-15 01:44:26 +00:00
1e60174891 Revert "[dynamo] Add run_inductor_tests entrypoint (#113278)"
This reverts commit b00311ce9e430cf1b98d2103e21ed2179450a424.

Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))
2023-11-15 01:19:48 +00:00
b00311ce9e [dynamo] Add run_inductor_tests entrypoint (#113278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278
Approved by: https://github.com/yanboliang
2023-11-11 08:54:43 +00:00
66d09f8217 [inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)
This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275
Approved by: https://github.com/yanboliang
ghstack dependencies: #113242
2023-11-11 03:17:35 +00:00
68bf0f1e7d Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit c967dc526a40f4b15003f9c99383acabe66367a6.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/PaliC due to the diff this is stacked on top of appears to be causing inductor failures internally ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1805131017))
2023-11-10 05:40:55 +00:00
c967dc526a [inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)
This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275
Approved by: https://github.com/yanboliang
2023-11-10 00:11:09 +00:00
ae85ba820f [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-11-02 07:39:13 +00:00
74e6c877e9 Revert "[inductor] Memory planning (#112178)"
This reverts commit f64a97c6f88873363c5b3c4c33f231b5578085b2.

Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too f64a97c6f8 ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))
2023-11-01 00:03:56 +00:00
f64a97c6f8 [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-31 20:02:30 +00:00
4ed4753ac3 [inductor][easy] skip test_extension_backend.py in fbcode (#111591)
Summary: It's currently failing. We should skip it in fbcode because cpp extensions don't work right now.

Differential Revision: D48852412

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111591
Approved by: https://github.com/desertfire
2023-10-23 22:37:13 +00:00
2bb59a9ac6 Fix vscode test discovery (#107404)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107404
Approved by: https://github.com/wconstab
2023-08-18 03:50:46 +00:00
9921b48558 Extend Inductor to support the third-party backend (#106874)
## Summary

This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression.

## Root Cause

Regarding the C++/OpenMP backend,  `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`.
c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)

In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend.

```python
def init_backend_registration(self):
    if get_scheduling_for_device("cpu") is None:
        from .codegen.cpp import CppScheduling

        register_backend_for_device("cpu", CppScheduling, WrapperCodeGen)

    if get_scheduling_for_device("cuda") is None:
        from .codegen.triton import TritonScheduling

        register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen)
```

## Solution

To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back.

## Compilation Latency Performance Result
We ran a single model benchmark and reproduced the compilation regression:

- Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart`

- W/ PR #100706, the compilation latency is about **57~58**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7
```

- W/O PR #100706, the compilation latency is about **46~47**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7
```

This PR fixed the compilation performance regression.

- W/ this PR #106874, the compilation latency is about **47~48**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874
Approved by: https://github.com/jansel
2023-08-16 04:11:36 +00:00
1819fe1324 Revert "Extend Inductor to support the third-party backend (#100706)" (#106652)
This reverts commit 05bd24bb3548105776cf73226927cbd0ed575c55.

It caused compilation time regression on torchbench, huggingface and dynamic models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652
Approved by: https://github.com/davidberard98, https://github.com/voznesenskym
2023-08-05 06:41:08 +00:00
83e36fe127 Fix vscode test discovery (#106490)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106490
Approved by: https://github.com/eellison
2023-08-03 09:02:22 +00:00
05bd24bb35 Extend Inductor to support the third-party backend (#100706)
This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done.

Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code.

- Python wrapper code generation

  Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions.

- Kernel code generation

  It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions.

  -   [group_fn](71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64))
  - [flush](71c4becda7/torch/_inductor/scheduler.py (L1150))
  - [can_fuse_vertical](71c4becda7/torch/_inductor/scheduler.py (L1006))
  - [can_fuse_horizontal](71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64))
  - [codegen_template](71c4becda7/torch/_inductor/scheduler.py (L1234)) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._
  - [codegen_nodes](71c4becda7/torch/_inductor/scheduler.py (L1234))
  - [codegen_sync](71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._

  The third-party backend needs to inherit from the `Scheduling` class and implement these functions.

Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706
Approved by: https://github.com/jansel
2023-08-02 05:13:51 +00:00