Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space).
Some caveats/changes:
- Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes
- Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension
Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint:
<img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" />
Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163090
Approved by: https://github.com/bobrenjc93
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.
Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980
Approved by: https://github.com/chenyang78
Persistent kernels are sometimes able to remove intermediate buffers that would
otherwise be needed for the non-persistent reduction kernel. This makes
multi kernel's codegen more complicated as it needs to drop these extra
arguments at runtime after selecting the correct kernel to run.
Instead, this PR updates the persistent kernel's `must_keep_buffers` so these
aren't dropped during codegen so both kernels have the same signature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724
Approved by: https://github.com/shunting314
ghstack dependencies: #131044
This fixes a couple errors that come up when multi-kernel is used with
split-scan.
1. The split-scan was being marked as a persistent kernel, which allowed
a multi-kernel to be created but this isn't supported. Fix is to
never mark split-scan as persistent.
2. Benchmark codegen was not handling WorkspaceArg, and would raise a
KeyError during codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044
Approved by: https://github.com/shunting314
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.
Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.
Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.
Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time
- persistent reduction
- regular reduction
A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime.
Here I talk more about implementation details:
- Inductor maintains states for generating kernels. E.g. the wrapper code. After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart.
***There is one thing I need some comments from others***:
There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel. But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list. Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex.
I'm not sure if there is some easy and clean way to resolve this.
Testing command:
```
TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469
Approved by: https://github.com/jansel