pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
PyTorch MergeBot	7ae0629d64	Revert "[inductor] turn on windows inductor UTs (#160161 )" This reverts commit f0980fc0bbd656d6c02d23ad97e945353b314f35. Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/clee2000 due to broke some inductor tests on windows inductor\test_codecache.py::TestStandaloneCompile::test_different_process [GH job link](https://github.com/pytorch/pytorch/actions/runs/16853706010/job/47748778757) [HUD commit link](`f0980fc0bb`). note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172784292))	2025-08-10 17:33:19 +00:00
Xu Han	f0980fc0bb	[inductor] turn on windows inductor UTs (#160161 ) With this PR, we can turn on the inductor UTs on Windows CPU. changes: 1. Turn on inductor UTs on Windows CPU. 2. Add a shard to balance added UTs, otherwise it should run timeout. 3. Fixed `test_invalid_artifact_flag_error_msg`. 4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`. 5. Skiped whole UT `test_cpu_select_algorithm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161 Approved by: https://github.com/jansel	2025-08-09 21:06:00 +00:00
bobrenjc93	5221448574	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-12 15:08:21 +00:00
PyTorch MergeBot	9c189ed29a	Revert "multi-kernel matmuls based on varying hint sizes (#156628 )" This reverts commit 6c795306378c47341d58109da03371bba2bec46e. Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))	2025-07-12 03:48:39 +00:00
bobrenjc93	6c79530637	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-11 19:38:10 +00:00
Xuehai Pan	17687eb792	[BE][4/6] fix typos in test/ (test/inductor/) (#157638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638 Approved by: https://github.com/yewentao256, https://github.com/jansel	2025-07-06 06:34:25 +00:00
Xuehai Pan	f5e6e52f25	[BE][PYFMT] migrate PYFMT for `test/inductor/` to `ruff format` (#148186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186 Approved by: https://github.com/jansel	2025-06-24 11:12:11 +00:00
leslie-fang-intel	c55eef79f8	[Inductor][CPP] Enable a config to use a small dequant buffer for woq int4 (#156395 ) Summary Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395 Approved by: https://github.com/jansel ghstack dependencies: #156407, #156387	2025-06-23 02:00:42 +00:00
leslie-fang-intel	3c7079959c	[Inductor][CPP] Enable WOQ int4 concat linear (#156387 ) Summary Enable the concat linear optimization pass in Inductor for woq int4 linear. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_concat_woq_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156387 Approved by: https://github.com/CaoE, https://github.com/jansel ghstack dependencies: #156407	2025-06-23 01:52:00 +00:00
sanchitintel	6ffa03ef9e	[Inductor-CPU] int8 WoQ concat linear (#153004 ) ### Summary int8 WoQ GEMM concat linear optimization pertaining to the same activation applied to 3 sets of weights of the same shape. ### Perf data GPT-J 128 input tokens, 128 output tokens. 32 physical cores of one socket of Intel(R) Xeon(R) 6972P (Xeon Gen 5). tcmalloc & Intel OpenMP were preloaded. \| May 8 nightly first token latency \| First token latency with this implementation \| Rest token latency with May 8 nightly \| Rest token latency with this implementation combined with #149373 \| \|---\|---\|---\|---\| \|202 ms \| 190 ms \| 33 ms \| 30 ms\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/153004 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>	2025-06-21 08:40:09 +00:00
leslie-fang-intel	f7a5ad6c29	[Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#156407 ) Summary There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407 Approved by: https://github.com/CaoE	2025-06-20 03:08:02 +00:00
Laith Sakka	3f69e3b3a0	Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757 ) address https://github.com/pytorch/pytorch/issues/153303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757 Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel	2025-06-19 04:50:18 +00:00
PyTorch MergeBot	06408dae49	Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757 )" This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec. Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))	2025-06-13 19:11:43 +00:00
Laith Sakka	0029259bdf	Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757 ) address https://github.com/pytorch/pytorch/issues/153303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757 Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel	2025-06-12 09:58:15 +00:00
Xia, Weiwen	b6f84b3b0f	[Inductor][CPU] Use AMX-based microkernels when M > 4 for GEMM template for INT4 weight (#155444 ) Summary GEMM templates for INT4 weights are used for lowering `aten._weight_int4pack_mm_for_cpu` with Inductor when max-autotune is on. Currently, AMX-based microkernels are used only when M >= 16 if input tensor has shape [M, K]. However, we find that AMX kernel brings performance benefit when 4 < M < 16. For example, on a 6th gen of Intel(R) Xeon(R) platform, E2E latency can be improved by up to > 20% when running Llama-3.1-8B on 32 cores for M = 8. So, this PR changes the threshold so that AMX is used when M > 4. Test plan ``` pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155444 Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel	2025-06-12 02:28:48 +00:00
sanchitintel	7482eb217c	[Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops (#149373 ) ### Summary Fixes #148494 Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension. Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching. \* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance ### Performance data with BS 1 Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine \| Model \| input tokens \| output tokens \| next-token latency before this PR \| Next-token latency after this change \| Speedup \| \|-----------\|-------------\|-----------------\|--------------------------------------\|------------------------------------------\|-----------\| \|GPT-J \| 128 \| 128 \| 42 ms \| 38 ms \| 9.52 % \| \| GPT-J \| 1024 \| 1024 \| 48 ms \| 45 ms \| 6.25 % \| \|LLaMA 3.1 8B Instruct \| 128 \| 128 \| 52 ms \| 47 ms\| 9.61% \| \|LLaMA 3.1 8B Instruct \| 1024 \| 1024 \| 57 ms \| 53 ms\| 7.01% \| While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373 Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>	2025-05-15 11:55:58 +00:00
clr	534b66fe30	torch.compile: Remove reference to the unused dynamo_config.dynamic_shapes from (#153297 ) tests This config option is not set anywhere, and does nothing, so this should cause no changes to tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153297 Approved by: https://github.com/Skylion007	2025-05-14 19:02:51 +00:00
Alexander Grund	d9d306e8e9	Fix inductor test_linear_with_in_out_buffer (#151548 ) Without MKL there is only 1 epilogue, not 2 because `addmm` is used instead of `packed_linear/_mkl_linear`. This fails first at `TestSelectAlgorithmCPU.test_linear_with_in_out_buffer_batch_size_8_in_features_3_in_features2_192_image_size_224_out_features_64_bias_True_cpu_float32` Instead of skipping the whole test just adjust the count for the single check. Final numbers of `test/inductor/test_cpu_select_algorithm.py` without MKL: ``` Ran 1337 tests OK (skipped=1211) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151548 Approved by: https://github.com/jansel	2025-04-26 01:53:34 +00:00
Xia, Weiwen	b247e5db33	[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AMX (#150603 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds AMX-based GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. It brings performance benefits on platforms where AMX is available. Validation results We have run GPT-J-6B and Llama-3-8B-Instruct on a 6th gen Xeon with 96 cores. Results show that the AMX-based microkernel outperforms AVX512-based one by >5x for prefill stage with 1024 input length. Test plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150603 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-04-23 05:58:55 +00:00
sanchitintel	e77ca19999	[Inductor-CPU] Fix int8 WoQ AMX micro-kernel when `block_n` is 16 or 48 (#149359 ) ### Summary When the block-size for `N` dimension is `48` for the AMX GEMM micro-kernel for int8 WoQ (BF16 activation, int8 statically quantized weights), the logic for handling the tail is incorrect - we can't always dequantize 32 elements of weights at a time because we may need to dequantize `32` followed by `16` when `block_n` is `48` (for each `K`). This PR fixes that logic, which was initially exposed with `M=17, N=1024, K=1024`. This PR also fixes the case of `block_n` being 16. I had introduced [this bug ](`ca9813ea14`) after misreading GEMM blockings as `["block_m", "block_k", "block_n"]` instead of `["block_m", "block_n", "block_k"]` (so I had wrongly assumed that `block_n` was always 32). ### Future work While this PR simply fixes a bug, it's possible to optimize the code pertaining to dequantizing & caching the B buffer - for `block_n` being `16` or `48`, `K` would always be a multiple of 2, so `K * block_n` will always be a multiple of 32. Since `dequantized_B_buf` stores rows contiguously, when `block_n` would be `16` or `48`, we could store 32 BF16 elements at a time instead of storing `16` at a time (when `block_n` is 16), or `32` followed by `16` at a time (when `block_n` is 48). Such an optimization would lower `register -> memory` data movements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149359 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-03-24 21:27:46 +00:00
bobrenjc93	621c801f78	fix dynamic float when dynamic=True (#149564 ) Fixes https://github.com/pytorch/pytorch/issues/149406#issuecomment-2738111733. Basically previously we would only make floats dynamic via automatic dynamic, now if you set dynamic=True, we will make the floats dynamic on the first compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149564 Approved by: https://github.com/laithsakka	2025-03-22 05:58:59 +00:00
Bin Bao	94d761fbf0	[AOTI][reland] Update test runner to use the new APIs (#149412 ) Summary: Reland https://github.com/pytorch/pytorch/pull/147105. Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring. Differential Revision: [D71470265](https://our.internmc.facebook.com/intern/diff/D71470265) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149412 Approved by: https://github.com/yushangdi	2025-03-19 17:56:44 +00:00
PyTorch MergeBot	405025778d	Revert "[AOTI] Update test runner to use the new APIs (#147105 )" This reverts commit 9a78513c3cb21a5f506135e2a56f967cf1fddc60. Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413))	2025-03-18 15:25:40 +00:00
Bin Bao	9a78513c3c	[AOTI] Update test runner to use the new APIs (#147105 ) Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring. Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105 Approved by: https://github.com/jingsh	2025-03-18 00:27:09 +00:00
Mitchell, Frost	2927a64357	[inductor][cpu] Fix error with FlexibleLayout weights in BMM (#148188 ) Fixes #148074 When node A is reshaped (is a `ReinterpretView`) and node B has a `FlexibleLayout`, then the layout of node B may be changed during the `kernel.select(options["W"], 0, self.b_index)` call, which could cause the assertion in `kernel.select` to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148188 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-03-05 01:05:05 +00:00
sanchitintel	1c544a9ddd	[Inductor-CPP] If all of the activation scale dims are 1, make it a 0D tensor (#147033 ) For int8 dynamically quantized activation & int8 quantized weights, add a workaround for some indexing issue that expected an empty index ( so, was expecting a 0D tensor) in epilogue creator when the activation scale was sized [1, 1] by converting it into a 0D tensor. The issue was discovered while running LLaMA2 quantized with torchao's `int8_dynamic_activation_int8_weight` quantization on CPU with max-autotune enabled (although this error would've occurred regardless). The final hidden states tensor that's activation to LM head is of shape `[batch_size, sequence_length, hidden_dim]` during decoding. For decoding one token at a time with batch size 1, sequence length is 1. The activation scale is shaped `[1, 1]` (reshaped from `[1, 1, 1]`). However, Inductor epilogue creator expects a 0D tensor in this case (my guess is that the corresponding logic in Inductor expects a 0D tensor if a tensor has only one element, even if it's 1D?). Pull Request resolved: https://github.com/pytorch/pytorch/pull/147033 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel	2025-03-03 18:32:27 +00:00
Xia, Weiwen	ab81ca5053	[Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756 ) Summary It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU. This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes. Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested. Test plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-03-03 00:56:29 +00:00
sanchitintel	5a1954eb93	[Inductor-CPU] Fix broken int8 WoQ GEMM AMX implementation in main (#147895 ) #146843 broke int8 WoQ GEMM's (for BF16 activation) AMX ISA implementation in the main branch. UT: `python test/inductor/test_cpu_select_algorithm.py -v -k woq` The issue remained undetected because in case of templated kernel compilation failure, the auto-tuning infra marks its runtime as `inf`, and the op against which it was being benchmarked is used, so UTs didn't fail even on machines that support AMX ISA. `test/inductor/test_cpu_select_algorithm.py` UTs checked the value of the `select_algorithm_autotune` counter, which only counts how many ops were selected for autotuning against their templated codegened counterparts. @leslie-fang-intel advised using a new counter. I added `counters["inductor"]["cpp_templated_kernel_counter"]`, which is incremented after a codegened kernel's compilation, so it'd help catch breakage scenarios in which a templated kernel could not be codegened due to a compilation failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147895 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-02-28 20:20:45 +00:00
Mitchell, Frost	2fb9416e6f	[inductor][cpu] Move VNNI weight packing into AMX GEMM kernel for contiguous BMM weights (#146843 ) Currently, the bfloat16 microkernel that uses AMX vectorization requires that the weights are in an interleaved VNNI format. For GEMM code, this hasn't been an issue because GEMM currently only supports constant weights, so the VNNI weight packing is done during compile-time and saved as a constant tensor to the graph. But for BMM ops where weights are not required to be constant, current code does an expensive reshape/VNNI packing for all BMM weights. This PR removes the need for the reshape/packing for non-constant inputs by moving VNNI packing inside the AMX microkernel. A new `K * block_n` buffer is used to store the temporary packed weights. Weight packing involves interleaving 2 rows of weights. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146843 Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-02-21 21:46:00 +00:00
leslie-fang-intel	bd019c0bb4	[Inductor][CPP] Fix node name for wgt delete (#147056 ) Summary This is a regression issue caused by a change in the FX node name. In commit 71010bf0972834e35a155e6a187e5c6649a5a36b, both the node name and target for the `get_attr` node in `V.graph.graph.nodes` were `_frozen_param2`. However, in the latest main, the node name has changed to `_reorder_linear_weight`. This PR fixes the regression by using the node's target instead of its name. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_weight_prune ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147056 Approved by: https://github.com/jgong5	2025-02-14 06:27:41 +00:00
leslie-fang-intel	c5a9e4a6a0	[Inductor][CPP] Fix a CPP GEMM Template output data type issue (#146958 ) Summary Issue found when fixing https://github.com/pytorch/ao/issues/1662. A FP32 GEMM with an epilogue node `to_fp16` resulted in [generated code](https://gist.github.com/leslie-fang-intel/464fb112abdb105818ae09b057350e84), which failed to compile. The root cause is that we used the slice of global buffer `Y` as the output of micro GEMM instead of a `local buffer`. However, due to the `to_fp16` epilogue node, the global buffer `Y` has a float16 data type, leading to the failure. This fix will ensure the use of a local buffer in such cases. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_linear_to_lowp_fp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146958 Approved by: https://github.com/jgong5	2025-02-14 01:40:08 +00:00
David Peixotto	9c78fb920d	Fix assertion failure in gemm template lowering (#146353 ) Summary: This commit fixes a crash in the gemm template lowering caused by hitting an [assert](`fd515e4f59/torch/_inductor/codegen/common.py (L1181)`) that a buffer was previously removed. The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set. The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings. Differential Revision: D68814625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146353 Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475	2025-02-08 01:52:20 +00:00
David Peixotto	97c0b7cb0a	Add unique identifer to bmm thread_mm functions (#145303 ) Summary: The bmm template generates code like this ``` template<bool accum> void cpp_fused_bmm_66_micro_gemm(...) { ... } void single_thread_mm() { ... cpp_fused_bmm_66_micro_gemm(...) ... } void threaded_mm() { ... cpp_fused_bmm_66_micro_gemm(...) ... } void cpp_fused_bmm_66(...) { ... single_thread_mm(...); ... threaded_mm(...); ... } ``` The generated `fused_bmm` and `fused_bmm_microgemm` functions both have unique identifiers added to their names, but the `single_threaded_mm` and `threaded_mm` do not. This diff adds unique identifies to those generated functions as well. The identifier is based on the kernel name. So for the example above we would generate a bmm template name like `cpp_fused_bmm_66_single_thread_mm()`. Differential Revision: D68364772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145303 Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475	2025-01-24 17:35:50 +00:00
sanchitintel	46851022ff	[Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM (#143187 ) ## Summary Templated `int8xint8->int32` GEMM that uses AMX ISA (present on Intel Xeon Gen 4 & above). Any epilogues such as weight scale, activation scale, and bias are applied per output block in a fused manner . Performs well for large values of `M` dimension (assuming canonical dimensions [`M, K`] and [`K, N`] for the activation & weight matrices'/tensors' sizes) when the activation is quantized per-token. Also supports SmoothQuant GEMM pattern when activation is quantized per-tensor (scalar scale) or per-token (vector scale is applied as an epilogue in this case). Also increased coverage of GEMM template for uint8 activation, int8 weight GEMM UTs for when the activation zero point is a 1D tensor (the existing implementation only accepted 0D tensors). However, some of such UTs would have to be explicitly enabled with `max-autotune` Inductor config. ## Performance data The templated codegened fused GEMM with M=32, K=4096, N=14336 used in LLaMA3 exhibits more than 2x perf-gain compared to oneDNN qlinear + mul (for activation's scale) with 48 cores of one socket of Xeon SP 4th gen Platinum 8468 when per-token quantization is used. For M=1, K=4096, N=14336, regardless of whether per-tensor quantization was used for activation or per-token, the perf gain was more than 3x. Intel OpenMP & libtcmalloc had been preloaded. All cores used by the workload corresponded to distinct physical cores. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143187 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5 Co-authored-by: Leslie Fang <leslie.fang@intel.com>	2025-01-22 02:27:53 +00:00
leslie-fang-intel	9d98b66e7b	[Inductor][CPP] Enable Epilogue Fusion for Grouped GEMM Template (#143897 ) Summary In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it. Fusion - The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM. - During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design. - In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles. Code Gen We maintain a list of epilogues and codegen it one by one. - If any of the GEMM has bias, we create a extra `bias_add` epilogue and prepend it at first of the epilogue list. - If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897 Approved by: https://github.com/jansel, https://github.com/jgong5 ghstack dependencies: #143796	2025-01-14 06:07:50 +00:00
leslie-fang-intel	25de671ea8	[Inductor][CPP] Enable Grouped GEMM Template (#143796 ) Summary Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012 - Support flexible number of GEMMs - Share activation across GEMMs - The Grouped GEMM Template supports independent activations - However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs - Each GEMM can have a unique weight but same sizes - Each GEMM can have a unique bias or None - Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR - Each GEMM have its own epilogues - Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear ``` Example Here is the example and generated code ``` batch_size = 4 in_features = 512 out_features = 1024 dtype = torch.bfloat16 class M(torch.nn.Module): def __init__(self, bias): super().__init__() self.linear0 = torch.nn.Linear(in_features, out_features, bias=False) self.linear1 = torch.nn.Linear(in_features, out_features, bias=False) def forward(self, x): return self.linear0(x), self.linear1(x) if __name__ == "__main__": with torch.no_grad(): input = torch.randn(batch_size, in_features, dtype=dtype) m = M(bias=bias).to(dtype=dtype).eval() cm = torch.compile(m) act_res = cm(input) ``` Generated Code: https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py Next Step - Support Epilogue fusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796 Approved by: https://github.com/jgong5, https://github.com/jansel	2025-01-14 05:59:07 +00:00
Mitchell, Frost	51a37a42e0	[inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#143141 ) Fixes #143102 Addresses 2 problems relating to dynamic batch size in BMM autotuner: 1. With dynamic batch size, when the input is a sympy Mult expression, such as `s0*8` which occurs in many dynamo benchmark models. We address this by using `size_hints` to solve for any expressions. This is safe since this section of the code is only called to generate inputs for benchmarking. 2. Some epilogue nodes may use the dynamic batch size as part of the codegen, for example when an input to the epilogue node is transposed and has dynamic batch size in the stride of other dimensions. When these epilogue nodes exist, if the sizevar is not already present in the `kernel.args`, it will create a new sizevar with a name. It is possible that subsequent calls to `def_kernel` could overwrite this variable name, so to avoid this we pass all the sizevars as `extra_sizevars` to the calls to `def_kernel` for the GEMM functions, so no variable renaming happens later in the BMM definition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143141 Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5	2025-01-05 18:02:37 +00:00
Sun, Jiayi	23e2f8ab3a	[Inductor] add flag for linear binary folding and turn it off by default (#142108 ) Fix https://github.com/pytorch/pytorch/issues/141755. Summary: linear binary folding results in a timm_model(levit_128) accuracy regression, this PR adds flag `enable_linear_binary_folding` for linear binary folding and turn it off by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142108 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-12-06 07:12:29 +00:00
Mitchell, Frost	20f24e3fbd	[inductor][cpp] Add BMM kernel template for autotuning (#129772 ) This PR adds the Cpp template for BMM, for FP32, FP16, and BF16. See #125683 for more background. 1. Adds `CppBmmTemplate` class which inherits from `CppPackedGemmTemplate`. Given a number of worker threads `num_threads` and batch size `B`, execute the Gemm kernel. For the first `B - (B % num_threads)` batch inputs, run one sub-gemm problem per thread. Then for the remaining `B % num_threads` sub-gemms, we execute each subproblem using the parallelized Gemm kernel. To manage this code, the `GEMM_TEMPLATE` from `CppPackedGemmTemplate` is rendered two different times, one with a single thread and one which includes the parallel OMP pragma. 2. Adapts `CppPackedGemmTemplate` to allow for child class. The `GEMM_TEMPLATE` is separated into different strings to allow for rendering by the child class. Slicing/indexing are adapted to allow for 3D BMM inputs. Additional methods `get_options()` and `_get_params_for_choices()` are added to reduce code duplication. BMM within `dlrm` benchmark has a single input buffer which is used for but X and W inputs. This is currently not supported in this PR. ### Performance On Granite/Sapphire Rapids, cpp_bmm template code uses AMX which requires an expensive transpose operation so the BMM op is rarely selected as faster than the existing external bmm kernel. As a result, speedup on SPR is identical with and without BMM code. Pass rate matches the rates for main exactly. #### Test Summary on Granite Rapids Test Scenario \| Comp Item \| Date \| Compiler \| torchbench \| huggingface \| timm_models -- \| -- \| -- \| -- \| -- \| -- \| -- Single Socket Multi-Threads \| Pass Rate \| gemm autotune\| inductor \| 91%, 73/80 \| 100%, 46/46 \| 100%, 61/61 \| \| \| bmm + gemm autotune \| inductor \| 91%, 73/80 \| 100%, 46/46 \| 100%, 61/61 \| \| Geomean Speedup \| gemm autotune\| inductor \| 2.15x \| 1.91x \| 2.52x \| \| \| bmm + gemm autotune \| inductor \| 2.15x \| 1.96x \| 2.53x Single Core Single-Thread \| Pass Rate \| gemm autotune \| inductor \| 91%, 73/80 \| 100%, 46/46 \| 100%, 61/61 \| \| \| bmm + gemm autotune\| inductor \| 91%, 73/80 \| 100%, 46/46 \| 100%, 61/61 \| \| Geomean Speedup \| inductor_locally_benchmark_586 \| inductor \| 2.43x \| 1.56x \| 2.60x \| \| \| inductor_locally_benchmark_585 \| inductor \| 2.45x \| 1.56x \| 2.63x This is not the case on an older Skylake Xeon machine. For the BMM ops contained in torchbench models, bmm performance improves by 1.10-2.64x. #### BF16 28-core Skylake Xeon \| Model \| Inductor \| GemmAutotune \| Gemm+BMM Autotune \| \|--------\|--------\|--------\|--------\| \| BERT_pytorch \| 1.233x \| 2.597x \| 2.608x \| \| hf_DistilBert \| 1.128x \| 2.242x \| 2.368x \| \| hf_Reformer \| 1.124x \| 1.419x \| 1.590x \| \| hf_T5_base \| 1.012x \| 1.257x \| 1.382x \| \| hf_T5_large \| 1.085x \| 2.228x \| 2.345x \| ## Example BMM Code ``` #include <c10/util/Unroll.h> #include <torch/csrc/inductor/aoti_torch/c/shim.h> template <bool accum> inline void cpp_bmm_micro_gemm_amx_kernel_32_2( AMXState& amx_state, const bfloat16* __restrict__ A, const bfloat16* __restrict__ B, float* __restrict__ C, int64_t K, int64_t lda, int64_t ldb, int64_t ldc, uint8_t tilecfg_rows ) { // TODO(jgong5): add prefetch hint for A, B, C auto loadconfig = [](const amx_tilecfg& cfg) { _tile_loadconfig(&cfg); }; const auto last_k_offset = K / 32 * 32; const auto tail_k_size = K - last_k_offset; if C10_LIKELY (last_k_offset > 0) { amx_state.configure(tilecfg_rows, 64, 32 / 16, 2, loadconfig); } else { amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 32 / 16, 2, loadconfig); } auto load_c = [&]() { _tile_loadd(0, C + 0 * ldc + 0, ldc * sizeof(float)); _tile_loadd(1, C + 0 * ldc + 16, ldc * sizeof(float)); _tile_loadd(2, C + 16 * ldc + 0, ldc * sizeof(float)); _tile_loadd(3, C + 16 * ldc + 16, ldc * sizeof(float)); }; auto zero_c = [&]() { _tile_zero(0); _tile_zero(1); _tile_zero(2); _tile_zero(3); }; if constexpr (accum) { load_c(); } else { zero_c(); } auto compute = [&](int k) { _tile_stream_loadd(4, A + 0 * lda + k, lda * sizeof(bfloat16)); _tile_loadd(6, B + k * ldb + 0, ldb * 2 * sizeof(bfloat16)); _tile_dpbf16ps(0, 4, 6); _tile_loadd(7, B + k * ldb + 32, ldb * 2 * sizeof(bfloat16)); _tile_dpbf16ps(1, 4, 7); _tile_stream_loadd(5, A + 16 * lda + k, lda * sizeof(bfloat16)); _tile_dpbf16ps(2, 5, 6); _tile_dpbf16ps(3, 5, 7); }; #pragma GCC unroll 4 for (int k = 0; k < last_k_offset; k += 32) { compute(k); } auto store_c = [&]() { // store to C _tile_stored(0, C + 0 * ldc + 0, ldc * sizeof(float)); _tile_stored(1, C + 0 * ldc + 16, ldc * sizeof(float)); _tile_stored(2, C + 16 * ldc + 0, ldc * sizeof(float)); _tile_stored(3, C + 16 * ldc + 16, ldc * sizeof(float)); }; // TODO(jgong5): move tail k computation to separate loopnest to save tile configuration overhead if C10_UNLIKELY (tail_k_size > 0) { if C10_LIKELY (last_k_offset > 0) { store_c(); amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 32 / 16, 2, loadconfig); load_c(); } compute(last_k_offset); } store_c(); } template <bool accum> inline void cpp_bmm_micro_gemm_amx_kernel_16_2( AMXState& amx_state, const bfloat16* __restrict__ A, const bfloat16* __restrict__ B, float* __restrict__ C, int64_t K, int64_t lda, int64_t ldb, int64_t ldc, uint8_t tilecfg_rows ) { // TODO(jgong5): add prefetch hint for A, B, C auto loadconfig = [](const amx_tilecfg& cfg) { _tile_loadconfig(&cfg); }; const auto last_k_offset = K / 32 * 32; const auto tail_k_size = K - last_k_offset; if C10_LIKELY (last_k_offset > 0) { amx_state.configure(tilecfg_rows, 64, 16 / 16, 2, loadconfig); } else { amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 16 / 16, 2, loadconfig); } auto load_c = [&]() { _tile_loadd(0, C + 0 * ldc + 0, ldc * sizeof(float)); _tile_loadd(1, C + 0 * ldc + 16, ldc * sizeof(float)); }; auto zero_c = [&]() { _tile_zero(0); _tile_zero(1); }; if constexpr (accum) { load_c(); } else { zero_c(); } auto compute = [&](int k) { _tile_stream_loadd(2, A + 0 * lda + k, lda * sizeof(bfloat16)); _tile_loadd(3, B + k * ldb + 0, ldb * 2 * sizeof(bfloat16)); _tile_dpbf16ps(0, 2, 3); _tile_loadd(4, B + k * ldb + 32, ldb * 2 * sizeof(bfloat16)); _tile_dpbf16ps(1, 2, 4); }; #pragma GCC unroll 4 for (int k = 0; k < last_k_offset; k += 32) { compute(k); } auto store_c = [&]() { // store to C _tile_stored(0, C + 0 * ldc + 0, ldc * sizeof(float)); _tile_stored(1, C + 0 * ldc + 16, ldc * sizeof(float)); }; // TODO(jgong5): move tail k computation to separate loopnest to save tile configuration overhead if C10_UNLIKELY (tail_k_size > 0) { if C10_LIKELY (last_k_offset > 0) { store_c(); amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 16 / 16, 2, loadconfig); load_c(); } compute(last_k_offset); } store_c(); } template <bool accum> inline void cpp_bmm_micro_gemm( AMXState& amx_state, const bfloat16* __restrict__ A, const bfloat16* __restrict__ B, float* __restrict__ C, int64_t M, int64_t N, int64_t K, int64_t lda, int64_t ldb, int64_t ldc ) { AOTI_TORCH_CHECK(N % 32 == 0, "N dimension must be multiple of 32"); AOTI_TORCH_CHECK(K % 2 == 0, "K dimension must be multiple of 2"); // TODO(jgong5): loop unroll for M and N for (int64_t n = 0; n < N; n += 32) { for (int64_t m = 0; m < M; m += 32) { int64_t block_m = std::min<int64_t>(M - m, 32); int64_t m_tail = m; if (block_m >= 32) { cpp_bmm_micro_gemm_amx_kernel_32_2<accum>( amx_state, A + m * lda, B + n, C + m * ldc + n, K, lda, ldb, ldc, 16 ); block_m -= 32; m_tail += 32; } else if (block_m >= 16) { cpp_bmm_micro_gemm_amx_kernel_16_2<accum>( amx_state, A + m * lda, B + n, C + m * ldc + n, K, lda, ldb, ldc, 16 ); block_m -= 16; m_tail += 16; } if (block_m > 0) { cpp_bmm_micro_gemm_amx_kernel_16_2<accum>( amx_state, A + m_tail * lda, B + n, C + m_tail * ldc + n, K, lda, ldb, ldc, block_m ); } } } } void threaded_mm(const bfloat16* X, const bfloat16* W, bfloat16* Y, const int64_t ks_b_index) { constexpr int64_t num_threads = 48; constexpr int64_t N = 64; constexpr int64_t K = 96; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; constexpr int64_t Nr_blocks = (N + Nr - 1) / Nr; constexpr int64_t Kr_blocks = (K + Kr - 1) / Kr; constexpr int64_t M = static_cast<int64_t>(384L); constexpr int64_t Mr_blocks = (M + Mr - 1) / Mr; constexpr int64_t Mt_blocks = 1; constexpr int64_t Nt_blocks = 1; constexpr int64_t Kt_blocks = 3; constexpr int64_t Mc_blocks = 1; constexpr int64_t Nc_blocks = 1; constexpr int64_t Kc_blocks = 3; constexpr int64_t num_Mc_blocks = (Mr_blocks + Mc_blocks - 1) / Mc_blocks; constexpr int64_t num_Nc_blocks = (Nr_blocks + Nc_blocks - 1) / Nc_blocks; constexpr int64_t num_Mt_blocks = (Mr_blocks + Mt_blocks - 1) / Mt_blocks; constexpr int64_t num_Nt_blocks = (Nr_blocks + Nt_blocks - 1) / Nt_blocks; constexpr int64_t num_Kt_blocks = (Kr_blocks + Kt_blocks - 1) / Kt_blocks; // make sure all partitions are assigned AOTI_TORCH_CHECK( Mt_blocks * Nt_blocks * Kt_blocks * 48 >= Mr_blocks * Nr_blocks * Kr_blocks, "Not all partitions are assigned." ); #pragma omp parallel num_threads(48) { const int tid = omp_get_thread_num(); const int64_t k_group_id = tid / num_Kt_blocks; const int64_t k_slice_id = tid % num_Kt_blocks; const int64_t n_group_id = k_group_id / num_Nt_blocks; const int64_t n_slice_id = k_group_id % num_Nt_blocks; const int64_t k_block_start = k_slice_id * Kt_blocks; const int64_t k_block_end = std::min(k_block_start + Kt_blocks, Kr_blocks); const int64_t n_block_start = n_slice_id * Nt_blocks; const int64_t n_block_end = std::min(n_block_start + Nt_blocks, Nr_blocks); const int64_t m_block_start = std::min(n_group_id * Mt_blocks, Mr_blocks); const int64_t m_block_end = std::min(m_block_start + Mt_blocks, Mr_blocks); const int64_t num_Mc_blocks_per_thread = (m_block_end - m_block_start + Mc_blocks - 1) / Mc_blocks; AMXState amx_state; auto _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocksMrNc_blocksNr)); auto local_acc_buf = _local_acc_buf.get(); for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) { const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread; const int64_t mc = m_block_start + my_mc_block_id Mc_blocks; const int64_t m_start = mc * Mr; const int64_t m_end = std::min(std::min(mc + Mc_blocks, m_block_end) * Mr, M); const int64_t m_size = m_end - m_start; for (int64_t nc = n_block_start; nc < n_block_end; nc += Nc_blocks) { const int64_t n_start = nc * Nr; const int64_t n_end = std::min(std::min(nc + Nc_blocks, n_block_end) * Nr, N); const int64_t n_size = n_end - n_start; // NB: assume we pad N, nc_block_end won't exceed padded N here. const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end); if (_local_acc_buf == nullptr) { _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocksMrNc_blocksNr)); local_acc_buf = _local_acc_buf.get(); } for (int64_t kc = k_block_start; kc < k_block_end; kc += Kc_blocks) { int64_t k_start = kc Kr; int64_t k_end = std::min(std::min(kc + Kc_blocks, k_block_end) * Kr, K); for (int64_t nci = nc; nci < nc_block_end; nci++) { if (kc == k_block_start) { cpp_bmm_micro_gemm<static_cast<bool>(false)>( amx_state, &(X[static_cast<int64_t>(k_start + (96Lm_start) + (36864Lks_b_index))]), &(W[static_cast<int64_t>((32Lk_start) + (3072Lnci) + (6144Lks_b_index))]), &(local_acc_buf[static_cast<int64_t>((Nrnci) + ((-1L)Nrnc))]), static_cast<int64_t>(m_end + ((-1L)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1L)k_start)), static_cast<int64_t>(96L), static_cast<int64_t>(32L), static_cast<int64_t>(Nc_blocksNr) ); } else { cpp_bmm_micro_gemm<static_cast<bool>(true)>( amx_state, &(X[static_cast<int64_t>(k_start + (96Lm_start) + (36864Lks_b_index))]), &(W[static_cast<int64_t>((32Lk_start) + (3072Lnci) + (6144Lks_b_index))]), &(local_acc_buf[static_cast<int64_t>((Nrnci) + ((-1L)Nrnc))]), static_cast<int64_t>(m_end + ((-1L)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1L)k_start)), static_cast<int64_t>(96L), static_cast<int64_t>(32L), static_cast<int64_t>(Nc_blocksNr) ); } } } { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(m_end + ((-1L)m_start)); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))); x1+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocksNrx0)), static_cast<int64_t>(16)); auto tmp1 = at::vec::convert<bfloat16>(tmp0); tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64Lm_start) + (64Lx0) + (24576Lks_b_index)), static_cast<int64_t>(16)); } for(int64_t x1=static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))); x1<static_cast<int64_t>(n_end + ((-1L)n_start)); x1+=(static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L))))) == 0 ? 1 : static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L))))))) { auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocksNrx0)), static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))))); auto tmp1 = at::vec::convert<bfloat16>(tmp0); tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64Lm_start) + (64Lx0) + (24576Lks_b_index)), static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))))); } } } } } } amx_state.release([]() { _tile_release(); }); } } void single_thread_mm(const bfloat16* X, const bfloat16* W, bfloat16* Y, const int64_t ks_b_index) { constexpr int64_t num_threads = 1; constexpr int64_t N = 64; constexpr int64_t K = 96; constexpr int64_t Mr = 32; constexpr int64_t Nr = 32; constexpr int64_t Kr = 32; constexpr int64_t Nr_blocks = (N + Nr - 1) / Nr; constexpr int64_t Kr_blocks = (K + Kr - 1) / Kr; constexpr int64_t M = static_cast<int64_t>(384L); constexpr int64_t Mr_blocks = (M + Mr - 1) / Mr; constexpr int64_t Mt_blocks = 12; constexpr int64_t Nt_blocks = 2; constexpr int64_t Kt_blocks = 3; constexpr int64_t Mc_blocks = 12; constexpr int64_t Nc_blocks = 1; constexpr int64_t Kc_blocks = 3; constexpr int64_t num_Mc_blocks = (Mr_blocks + Mc_blocks - 1) / Mc_blocks; constexpr int64_t num_Nc_blocks = (Nr_blocks + Nc_blocks - 1) / Nc_blocks; constexpr int64_t num_Mt_blocks = (Mr_blocks + Mt_blocks - 1) / Mt_blocks; constexpr int64_t num_Nt_blocks = (Nr_blocks + Nt_blocks - 1) / Nt_blocks; constexpr int64_t num_Kt_blocks = (Kr_blocks + Kt_blocks - 1) / Kt_blocks; // make sure all partitions are assigned AOTI_TORCH_CHECK( Mt_blocks * Nt_blocks * Kt_blocks * 1 >= Mr_blocks * Nr_blocks * Kr_blocks, "Not all partitions are assigned." ); { constexpr int tid = 0; constexpr int64_t k_group_id = 0; constexpr int64_t k_slice_id = 0; constexpr int64_t n_group_id = 0; constexpr int64_t n_slice_id = 0; constexpr int64_t m_block_start = 0; constexpr int64_t n_block_start = 0; constexpr int64_t n_block_end = Nr_blocks; constexpr int64_t k_block_start = 0; constexpr int64_t k_block_end = Kr_blocks; constexpr int64_t num_Mc_blocks_per_thread = num_Mc_blocks; constexpr int64_t m_block_end = Mr_blocks; AMXState amx_state; auto _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocksMrNc_blocksNr)); auto local_acc_buf = _local_acc_buf.get(); for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) { const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread; const int64_t mc = m_block_start + my_mc_block_id Mc_blocks; const int64_t m_start = mc * Mr; const int64_t m_end = std::min(std::min(mc + Mc_blocks, m_block_end) * Mr, M); const int64_t m_size = m_end - m_start; for (int64_t nc = n_block_start; nc < n_block_end; nc += Nc_blocks) { const int64_t n_start = nc * Nr; const int64_t n_end = std::min(std::min(nc + Nc_blocks, n_block_end) * Nr, N); const int64_t n_size = n_end - n_start; // NB: assume we pad N, nc_block_end won't exceed padded N here. const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end); if (_local_acc_buf == nullptr) { _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocksMrNc_blocksNr)); local_acc_buf = _local_acc_buf.get(); } for (int64_t kc = k_block_start; kc < k_block_end; kc += Kc_blocks) { int64_t k_start = kc Kr; int64_t k_end = std::min(std::min(kc + Kc_blocks, k_block_end) * Kr, K); for (int64_t nci = nc; nci < nc_block_end; nci++) { if (kc == k_block_start) { cpp_bmm_micro_gemm<static_cast<bool>(false)>( amx_state, &(X[static_cast<int64_t>(k_start + (96Lm_start) + (36864Lks_b_index))]), &(W[static_cast<int64_t>((32Lk_start) + (3072Lnci) + (6144Lks_b_index))]), &(local_acc_buf[static_cast<int64_t>((Nrnci) + ((-1L)Nrnc))]), static_cast<int64_t>(m_end + ((-1L)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1L)k_start)), static_cast<int64_t>(96L), static_cast<int64_t>(32L), static_cast<int64_t>(Nc_blocksNr) ); } else { cpp_bmm_micro_gemm<static_cast<bool>(true)>( amx_state, &(X[static_cast<int64_t>(k_start + (96Lm_start) + (36864Lks_b_index))]), &(W[static_cast<int64_t>((32Lk_start) + (3072Lnci) + (6144Lks_b_index))]), &(local_acc_buf[static_cast<int64_t>((Nrnci) + ((-1L)Nrnc))]), static_cast<int64_t>(m_end + ((-1L)m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1L)k_start)), static_cast<int64_t>(96L), static_cast<int64_t>(32L), static_cast<int64_t>(Nc_blocksNr) ); } } } { { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(m_end + ((-1L)m_start)); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))); x1+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocksNrx0)), static_cast<int64_t>(16)); auto tmp1 = at::vec::convert<bfloat16>(tmp0); tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64Lm_start) + (64Lx0) + (24576Lks_b_index)), static_cast<int64_t>(16)); } for(int64_t x1=static_cast<int64_t>(16L(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))); x1<static_cast<int64_t>(n_end + ((-1L)n_start)); x1+=(static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L))))) == 0 ? 1 : static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L))))))) { auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocksNrx0)), static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))))); auto tmp1 = at::vec::convert<bfloat16>(tmp0); tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64Lm_start) + (64Lx0) + (24576Lks_b_index)), static_cast<int64_t>(n_end + ((-1L)n_start) + ((-16L)(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)n_start))), static_cast<int64_t>(16L)))))); } } } } } } amx_state.release([]() { _tile_release(); }); } } extern "C" void cpp_bmm(const bfloat16* X, const bfloat16* W, bfloat16* Y) { const int64_t B = static_cast<int64_t>(5L); constexpr int64_t num_threads = 48; int64_t B_single_thread_block = (B / num_threads) * num_threads; #pragma omp parallel for num_threads(48) for (int64_t b_start = 0; b_start < B_single_thread_block; ++b_start) { single_thread_mm(X, W, Y, b_start); } for (int64_t b_start = B_single_thread_block; b_start < B; ++b_start) { threaded_mm(X, W, Y, b_start); } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129772 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-12-06 04:54:00 +00:00
PyTorch MergeBot	ad37afd590	Revert "Always unspecialize float in OSS (#138922 )" This reverts commit ba5253da9b30ed4d998cee1d865f92b2c27d3086. Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/yf225 due to perf regression on torchbench ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2499277511))	2024-11-26 00:03:03 +00:00
Bob Ren	ba5253da9b	Always unspecialize float in OSS (#138922 ) Fixes https://github.com/pytorch/pytorch/issues/107277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-24 01:58:13 +00:00
PyTorch MergeBot	a8c90e5140	Revert "Always unspecialize float in OSS (#138922 )" This reverts commit 6d779d05492813da1c19ac0c562d0d5f8473f27e. Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is some slow tests failing after this land ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2495076878))	2024-11-22 23:18:36 +00:00
Bob Ren	6d779d0549	Always unspecialize float in OSS (#138922 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-22 17:54:42 +00:00
Sun, Jiayi	93e3c91679	[inductor] support linear+binary foldinig for freezing path (#138807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138807 Approved by: https://github.com/jgong5, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-11-20 05:34:09 +00:00
CaoE	9e14d86573	[Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255 ) `kernel_micro_gemm` generated using BRGEMM: ``` template <bool accum> inline void kernel_micro_gemm( const half* __restrict__ A, const half* __restrict__ B, float* __restrict__ C, int64_t M, int64_t N, int64_t K, int64_t lda, int64_t ldb, int64_t ldc ) { at::native::cpublas::brgemm( M, N, K, lda, ldb, ldc, 1.f, accum ? 1.f : 0.f, A, B, C); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-05 05:33:29 +00:00
leslie-fang-intel	d84a344410	[Inductor] Skip coordinate_descent_tuning for mm/bmm decomposition on CPU (#139537 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/138823, `coordinate_descent_tuning` doesn't benefit on CPU and prefer lowering `mm`/`bmm` into ATEN kernels or CPP GEMM Template. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_coordinate_descent_tuning ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139537 Approved by: https://github.com/jansel	2024-11-03 10:10:29 +00:00
sanchitintel	3cbf0c0bbf	[Inductor][CPP] Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688 ) # Summary The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights. This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension. The figure below is from the document linked above - <img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94"> ## Performance data Collected on 48 physical cores of one socket of Intel Xeon Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded. \|M \| N \| K \| Latency with ATen _weight_int8pack_mm \| Latency with codegened templated GEMM (current main branch) \| Latency with codegened templated GEMM (this PR) \| \|-----\|-----\|-----\|------\|----------\|----\| \|4096\|4096\|4096\| 45.844 ms \| 9.322 ms\| 5.2181 ms \| \|4096\|11008\|4096\| 127.618 ms \|24.6258 ms \| 13.6046 ms\| \|4096\|4096\|11008\| 121.953 ms \| 25.4692 ms \| 10.2669 ms \| \|4096\|32000\|4096\| 478.450 ms\| 75.3942 ms \| 48.21 ms \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688 Approved by: https://github.com/jgong5	2024-11-01 16:32:22 +00:00
leslie-fang-intel	0b8048c78a	Fix AOTI CPP GEMM Template issue without freezing (#136421 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight ``` ReinterpretView( StorageBox( ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1])) ), FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]), origins=OrderedSet([addmm]) ) ``` In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack. Test Plan ``` python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-10-09 02:19:07 +00:00
Wu, Chunyuan	c2637a7b26	[inductor] [cpp] fix gemm_output_name conflict (#136419 ) Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case: ```log File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call C_ptr = f"&({kernel.index(C, [0, 0])})" File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index else self.args.input(node.get_name()) File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input assert name not in V.graph.removed_buffers, name AssertionError: buf_GemmOut ``` The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer. The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`. When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict. Rename these buffers by adding the name of the `template_buffer` as the prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136419 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418, #136518	2024-09-27 12:23:17 +00:00
Wu, Chunyuan	2e30c160ef	[inductor] [cpp] fix max-autotune for single-thread dynamic shapes (#136418 ) Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case: ``` /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16, const bfloat16, const bfloat16, bfloat16, int64_t)’: /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression 279 \| constexpr int64_t m_block_end = Mr_blocks; \| ^~~~~~~~~ /tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression 237 \| const int64_t Mr_blocks = (M + Mr - 1) / Mr; \| ^~~~~~~~~ ``` The PR also updates the UT to add a test for `BS`=512 in single thread. The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue. By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136418 Approved by: https://github.com/jgong5	2024-09-25 09:24:05 +00:00

1 2 3

130 Commits