Commit Graph

553 Commits

Author SHA1 Message Date
8fbd7d92a8 Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887)
## Summary

As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond).

WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations.
The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue.

Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel.
While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would  use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

### Performance
#### AMX
Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded.

In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead.

Benchmarked with unit-tests.

Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442

The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel.

#### AVX2/AVX512 micro-kernels

Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437

### Follow-up
1. int4 WoQ GEMM micro-kernel will also be added in a separate PR.
2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

E2E perf measurement should be done with #131310.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-10 02:01:04 +00:00
065f7aa44b [inductor] tensor_is_align fallbacking False if unbacked expr not comptime evaled (#132423)
Currently if storage_offset is unbacked symbol and is_align can not be computed compiletime - it hard fails.

Doing the best we can: adding guard_size_oblivious and fallback on False if can not be evaluated compiletime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132423
Approved by: https://github.com/ezyang
2024-08-09 15:07:42 +00:00
700a11fdd4 Make inductor kernel metadata comments more descriptive (#126698)
Summary:

A couple of improvements to the generated comments in inductor kernels:

1. Makes the nodes in the comment topologically sorted, I think having them
   alphabetically sorted is a gotcha. I was always confused on why the
   sorting in the comments did not match the code.
2. Adds a printout of the aten graph fragment corresponding to the
   current inductor kernel, to make it easier to map from aten
   code to inductor code

Example float8-overhead-related inductor kernel comment after this PR:

```
# kernel path: /tmp/torchinductor_vasiliy/27/c27ts3rdw56ns7od5j6ovdnhxphished2lcu3adclzzixoo7khg5.py
# Source Nodes: [weight_fp8], Original ATen: [aten.mul, aten.clamp, aten._to_copy]
# Source node to ATen node mapping:
#   weight_fp8 => clamp_max_1, clamp_min_3, convert_element_type_10, convert_element_type_11, convert_element_type_9, mul_3
# Graph fragment:
#   %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %convert_element_type_8), kwargs = {})
#   %convert_element_type_9 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%mul_3, torch.float32), kwargs = {})
#   %clamp_min_3 : [num_users=1] = call_function[target=torch.ops.aten.clamp_min.default](args = (%convert_element_type_9, -448.0), kwargs = {})
#   %clamp_max_1 : [num_users=1] = call_function[target=torch.ops.aten.clamp_max.default](args = (%clamp_min_3, 448.0), kwargs = {})
#   %convert_element_type_10 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%clamp_max_1, torch.bfloat16), kwargs = {})
#   %convert_element_type_11 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%convert_element_type_10, torch.float8_e4m3fn), kwargs = {})
triton_poi_fused__to_copy_clamp_mul_5 = async_compile.triton('triton_', '''
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126698
Approved by: https://github.com/ezyang
ghstack dependencies: #126573
2024-08-07 21:25:09 +00:00
1c7dc335f7 [ROCm][CK][Inductor] Enable addmm for CK backend to gemm max autotune (#130576)
Add functional support for torch.addmm with CK backend. See also #125453

# Implementation details
1. It turns out we can use the same template between addmm and matmul; essentially, matmul is addmm with empty bias
2. The Python generator in CK was updated to generate the shared cpp template. The pip package can be installed from `pip install git+https://github.com/rocm/composable_kernel@add-addmm` and will be merged into `develop` branch after this PR lands to avoid breaking the current matmul

# Testing
`pytest test/inductor/test_ck_backend.py -k addmm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130576
Approved by: https://github.com/chenyang78
2024-08-05 17:49:09 +00:00
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
f9e4d05c15 Save and run post compilation steps within FXGraphCache (#130572)
This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that:
- When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification).
- When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch.

What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it **returns exactly what compile_fx_inner returns**. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner.

## What's a post compile step?
We define a **post-compile** to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include:
- Setting the tracing context's output strides
- Running cudagraphs if enabled
- Maybe realign inputs if cudagraphs didn't run

To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object.

## Splitting cudagraphs work into pre/post compile
Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages.

Implementation notes:
- We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness.
- ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache **doesn't** have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so *only* inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result.
- Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572
Approved by: https://github.com/eellison
2024-07-31 18:32:40 +00:00
e7eeee473c [BE][Easy][14/19] enforce style for empty lines in import segments in torch/_[a-c]*/ and torch/_[e-h]*/ and torch/_[j-z]*/ (#129765)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765
Approved by: https://github.com/ezyang
2024-07-31 10:42:50 +00:00
882d80fd92 Add lowering for updated _scaled_mm (fixing submodules) (#130422)
Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in https://github.com/pytorch/pytorch/pull/128683.

The lowering does:
- for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations.
- for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in https://github.com/pytorch/pytorch/pull/125204) and Triton kernel configurations.

The Triton kernel template is based on 3ad9031d02 (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py`

## Testing:
- Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types.
- Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast:
    - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row'
        - P1477224245 - 2 kernels
    - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row'
        - P1477227340 - 2 kernels

- UT `python test/inductor/test_fp8.py -- TestFP8Lowering`

## Benchmarking

Eager/compiled tensor-wise/row-wise scaling for various shapes:
https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669
- Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance.

Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes:
https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446

## Questions for reviewers:
- Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)?

## Todo:
- Make the Triton template use the improved persistent kernel version (https://github.com/pytorch/FBGEMM/pull/2735 by @htyu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130422
Approved by: https://github.com/ipiszy
2024-07-30 23:48:48 +00:00
9598c58618 Add config option to skip autotuning conv (#131839)
requested internally bc for some models the conv templates are not very helpful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839
Approved by: https://github.com/oulgen
ghstack dependencies: #131400
2024-07-30 01:57:53 +00:00
5a2620302b [inductor] Replace self_cuda_time_total function calls with self_dev… (#131029)
…ice_time_total for wrapper_bench

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131029
Approved by: https://github.com/shunting314
2024-07-30 01:57:39 +00:00
8b507a922a Mode to emulate amp numerics (#131595)
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```

We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.

in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.

This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
2024-07-29 22:42:23 +00:00
62b2e7a553 Revert "Add config option to skip autotuning conv (#131839)"
This reverts commit 3d4de8e96d0bb1fe19b25734a97a19dd85313692.

Reverted https://github.com/pytorch/pytorch/pull/131839 on behalf of https://github.com/eellison due to wrong config name ([comment](https://github.com/pytorch/pytorch/pull/131839#issuecomment-2257117221))
2024-07-29 22:31:51 +00:00
3d4de8e96d Add config option to skip autotuning conv (#131839)
requested internally bc for some models the conv templates are not very helpful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839
Approved by: https://github.com/oulgen
ghstack dependencies: #131400
2024-07-29 16:43:58 +00:00
945bf78894 Revert "[BE] typing for decorators - fx/_compatibility (#131568)"
This reverts commit 193f62fde91ee20deb5ddcd9ff4593cd78d74c64.

Reverted https://github.com/pytorch/pytorch/pull/131568 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
aee6bcdba4 [Traceable FSDP2][Inductor] Apply compute/comm reordering passes to achieve overlap (#131614)
This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614
Approved by: https://github.com/yifuwang
ghstack dependencies: #131510
2024-07-27 08:39:58 +00:00
9e06572704 [Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510)
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)

This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").

The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
2024-07-27 08:39:58 +00:00
a90b8b967a [inductor] enable windows inductor UTs (#131767)
Changes:
1. Add `skipIfWindows` function.
2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules.
3. Disable some UTs, which are not passed on Windows.
4. Enable test_torchinductor in Windows CI.

I have tested passed on my dev machine:
<img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb">

TODO: review and fix the skipped cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:03 +00:00
16d7cb5049 [CUDAGraph] Type annotation for cudagraph_trees.py (#131621)
As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621
Approved by: https://github.com/eellison
2024-07-26 06:14:06 +00:00
03f49c9523 Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621)"
This reverts commit 16699c7d848fca669865d83ffff205bcbb8665be.

Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))
2024-07-26 02:08:45 +00:00
16699c7d84 [CUDAGraph] Type annotation for cudagraph_trees.py (#131621)
As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621
Approved by: https://github.com/eellison
2024-07-26 01:40:23 +00:00
193f62fde9 [BE] typing for decorators - fx/_compatibility (#131568)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131568
Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519
2024-07-25 22:24:19 +00:00
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
0b44e1a74c [inductor][cpp][gemm] optimize arbitrary N in packed gemm template (#130690)
Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer.

Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.

Before
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
  _linear_pointwise 2.3563 ms 100.0%
  cpp_packed_gemm_0 710.5902 ms 0.3%

After
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
  cpp_packed_gemm_0 1.8909 ms 100.0%
  _linear_pointwise 2.1016 ms 90.0%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130675
2024-07-20 06:30:15 +00:00
16aaff7783 Fix mm pad regresion - more conservative estimation of plannable inputs (#128909)
- More conservative estimation of plannable inputs
- Consider constant_pad_nd as pointwise node in concat lowering
- Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909
Approved by: https://github.com/Chillee
2024-07-18 16:42:30 +00:00
fc238db62a Separate AOTI Eager utils as a single file (#125819)
The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire
2024-07-17 02:27:11 +00:00
c1e7e40f24 Revert "[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773)"
This reverts commit f2f31027ce8dc3985663bf6eaa66f3c5559b724a.

Reverted https://github.com/pytorch/pytorch/pull/129773 on behalf of https://github.com/clee2000 due to failed inductor/test_torchinductor_dynamic_shapes.py on mac https://github.com/pytorch/pytorch/actions/runs/9963396991/job/27530249256 f2f31027ce.  The build failed on PR so test jobs didn't run ([comment](https://github.com/pytorch/pytorch/pull/129773#issuecomment-2231808437))
2024-07-16 20:54:14 +00:00
f2f31027ce [Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773)
FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead.

This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op).

One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes.

---

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor`

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773
Approved by: https://github.com/eellison
2024-07-16 20:07:41 +00:00
200d3d0a89 Remove static param counting if inlining NN modules (#130503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130503
Approved by: https://github.com/bdhirsh
ghstack dependencies: #130391, #130392
2024-07-16 00:25:34 +00:00
f52b2ee90f Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`.

Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman
2024-07-11 13:17:25 +00:00
10c831567b Make sympify'ing SymInt/etc produce their sympy expression (#130166)
There is one huge problem this fixes: today, sympify(symint)
produces a float(!!) because Sympy attempts to see if you can
coerce the symint to float in sympify and of course this works on
SymInt.

However, this also has another nontrivial effect: anywhere in Inductor
where sympy expressions are passed around, it is also valid to pass
around a SymInt now.  I'm ambivalent about this: it's currently a
mistake to be passing around a SymInt when a sympy expression is
expected.  But maybe this is fine?

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130166
Approved by: https://github.com/yf225
2024-07-06 03:56:45 +00:00
b6379591a9 [Inductor][CPP] Pass weight dtype explicitly for cpp gemm template (#129221)
**Summary**
This PR mainly refactor 2 things:

1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered.
2. Add an util function to get the output data type and compute data type from input data type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129221
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048, #129049, #129103, #129220
2024-07-02 13:06:32 +00:00
1a689ea38c [Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op (#129048)
**Summary**
Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU`

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16/uint8
- Post Op Fusion: with unary post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise
```

**Next Step**
- [✓] Unary post op fusion
- [✓] Int8 output
- [ ] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129048
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825
2024-06-30 09:53:55 +00:00
35a197defa [Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output (#128825)
**Summary**
Support int8 GEMM Template with refer MicroInt8GEMM kernel for case:

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16
- Post Op Fusion: without unary post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise
```

**Next Step**
- [ ] Unary post op fusion
- [ ] Int8 output
- [ ] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128825
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-30 09:45:43 +00:00
b93bf55b6a [halide-backend] Add GPU support (#127506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026
2024-06-29 14:06:21 +00:00
87d14ad419 [inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)
Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR:
* Fix the with_fresh_cache_if_config() decorator
* Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257
Approved by: https://github.com/oulgen
2024-06-26 18:34:48 +00:00
79959d707c [Inductor][ROCm] Composable Kernel backend for Inductor (#125453)
This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection.

The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are
 (1) customizing compiler for the ROCm platform
 (2) customizing template code generation for Composable Kernel Universal GEMM instances.

We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance.

### Testing
Install the ck library
```
pip install git+https://github.com/rocm/composable_kernel@develop
```
Run the test
```
TORCH_LOGS=+torch._inductor \
pytest --capture=tee-sys test/inductor/test_ck_backend.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453
Approved by: https://github.com/eellison, https://github.com/jansel
2024-06-25 20:54:14 +00:00
ad76da6c16 Revert "[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)"
This reverts commit 7b57ddd38c6d502ba313c0e6b0c92b6787d69986.

Reverted https://github.com/pytorch/pytorch/pull/129257 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 4c1e4c5f30, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))
2024-06-25 16:48:32 +00:00
e6bfa2958b Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-25 02:45:02 +00:00
7b57ddd38c [inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)
Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR:
* Fix the with_fresh_cache_if_config() decorator
* Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257
Approved by: https://github.com/oulgen
2024-06-24 23:39:43 +00:00
fbca70718f Fix scatter lowering when src is a Number (#129096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129096
Approved by: https://github.com/peterbell10
2024-06-24 22:19:39 +00:00
e84cf805d2 Revert "Modularize aten parameter parser and checker (#125308)"
This reverts commit 60bbdc0b40656cf70b2b098c7d715e19f031fb0d.

Reverted https://github.com/pytorch/pytorch/pull/125308 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125308#issuecomment-2181327211))
2024-06-20 18:52:05 +00:00
254487f288 Revert "Separate AOTI Eager utils as a single file (#125819)"
This reverts commit 18634048a1f939a961b7c96b0acfe78b474c821e.

Reverted https://github.com/pytorch/pytorch/pull/125819 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125819#issuecomment-2181317332))
2024-06-20 18:49:08 +00:00
571a0db132 [inductor] Fix logging for run_and_get_cpp_code (#128794)
Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794
Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel
2024-06-19 21:32:34 +00:00
18634048a1 Separate AOTI Eager utils as a single file (#125819)
The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #125308
2024-06-15 13:42:49 +00:00
60bbdc0b40 Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used.

- `Tensor`
- `bool`
- `int64_t`
- `TensorList`
- `Scalar`
- `c10::SymIntArrayRef`
- `::std::optional<Tensor>`
- `IntArrayRef`
- `double`
- `c10::SymInt`
- `::std::optional<ScalarType>`
- `::std::optional<double>`
- `::std::optional<bool>`
- `::std::optional<Layout>`
- `::std::optional<Device>`
- `::std::optional<int64_t>`
- `Dimname`
- `::std::optional<Generator>`
- `c10::string_view`
- `::std::optional<c10::string_view>`
- `OptionalIntArrayRef`
- `::std::optional<Scalar>`
- `OptionalSymIntArrayRef`
- `::std::optional<MemoryFormat>`
- `::std::optional<c10::SymInt>`
- `ScalarType`
- `ArrayRef<Scalar>`
- `DimnameList`
- `::std::optional<ArrayRef<double>>`
- `::std::array<bool,3>`
- `::std::optional<DimnameList>`
- `c10::List<::std::optional<Tensor>>`
- `::std::array<bool,2>`
- `Storage`
- `::std::array<bool,4>`
- `Device`
- `DeviceIndex`
- `ITensorListRef`
- `Stream`
- `Layout`
- `MemoryFormat`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-15 09:18:44 +00:00
c187593418 Prevent expansion of cat indexing to avoid int64 intermediate (#127815)
Fix for https://github.com/pytorch/pytorch/issues/127652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815
Approved by: https://github.com/shunting314, https://github.com/peterbell10
2024-06-14 15:42:08 +00:00
17b45e905a Fix get output code when caching is enabled (#128445)
Summary: Improve output code retrieval mechanism so that it works in the presence of cache hits.

Test Plan: ci

Differential Revision: D58429602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128445
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/masnesral
2024-06-13 16:00:30 +00:00
c897651392 [inductor] Add BackendFeature gating (#128266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266
Approved by: https://github.com/shunting314
2024-06-13 07:31:51 +00:00
cc231a8e2b First version of AOTAutogradCache (#126791)
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.

CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.

On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object

On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.

For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.

V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.

We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.

Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
2024-06-12 20:04:44 +00:00
f2dcbe89d6 Revert "Prevent expansion of cat indexing to avoid int64 intermediate (#127815)"
This reverts commit 793df7b7cb1473004837f5867f4c1c4b2b0f751d.

Reverted https://github.com/pytorch/pytorch/pull/127815 on behalf of https://github.com/clee2000 due to the newly added test is failing internally D58444153.  Test exists in opensource and passed in OSS CI, maybe env difference? ([comment](https://github.com/pytorch/pytorch/pull/127815#issuecomment-2163421968))
2024-06-12 16:09:22 +00:00