pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Nikhil Gupta	94737e8a2a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-20 19:32:03 +00:00
PyTorch MergeBot	4462cc6375	Revert "[Inductor] inplace padding (#140249 )" This reverts commit 297ce776363cc4802fa74d210fced2b4128960d5. Reverted https://github.com/pytorch/pytorch/pull/140249 on behalf of https://github.com/huydhn due to This break an internal test https://fburl.com/test/ppl2we5l ([comment](https://github.com/pytorch/pytorch/pull/140249#issuecomment-2556079406))	2024-12-20 01:30:27 +00:00
PyTorch MergeBot	8136daff5a	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit 4b82251011f85f9d1395b451d61e976af844d9b1. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))	2024-12-19 23:33:17 +00:00
Nikhil Gupta	4b82251011	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-19 18:51:26 +00:00
PyTorch MergeBot	14fe1f7190	Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )" This reverts commit d3ff2d42c28a2c187cbedfd8f60b84a4dfa2d6bf. Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))	2024-12-19 01:05:11 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Yuanhao Ji	6bbbb08458	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [10/N] (#142451 ) > This is the last one related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 - #140922 - #140924 - #140933 - #142451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142451 Approved by: https://github.com/bdhirsh	2024-12-17 12:18:29 +00:00
Laith Sakka	c3f3a6e4d2	Back out "Fix undesired specialization on slice after split. (#142372 )" (#143356 ) Summary: Original commit changeset: e54ffcc9fd48 Original Phabricator Diff: D67113058 Reviewed By: ezyang Differential Revision: D67311579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143356 Approved by: https://github.com/oulgen	2024-12-17 09:17:18 +00:00
Shunting Zhang	297ce77636	[Inductor] inplace padding (#140249 ) https://github.com/pytorch/pytorch/issues/139865 This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine. Perf for `test_linear_and_cel`: ``` # TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=False ms=83.311 # TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel inductor_config.inplace_padding=True ms=79.827 ``` The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c. - Without the feature: 182.151ms per batch, 180.9K tokens/s - With the feature: 178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase. Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) . UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249 Approved by: https://github.com/jansel	2024-12-17 06:15:48 +00:00
Yidi Wu	65d0a25289	[associative_scan] patch inductor tests to always run with static shape (#143161 ) fixes #143053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143161 Approved by: https://github.com/eellison	2024-12-13 21:06:12 +00:00
Blaine Burton Rister	520ba556cd	[Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land. The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR. These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf. # Test plan The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.` to `r0_.`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-12 17:22:20 +00:00
Yukio Siraichi	e647b6d590	Fix undesired specialization on slice after split. (#142372 ) Fix: #141251 This PR adds a few static guard checks when decomposing and lowering the `slice` operation, so that we avoid adding unnecessary guards. Specifically, when clamping the end values. In summary, the changes are: - `slice` dynamo decomposition: checks `end >= sizes[dim]` statically. If we don't know that, the following guard ensures that we (don't) need clamping. - `evaluate_min` inductor `sizevar` function: checks whether we can solve it statically or not, before actually creating a new guard. The latter had to be changed because `evaluate_min` (called by `ir.SliceView` constructor) would always try to create a guard based on the hints operation result. However, if both `left` and `right` hints were true, it would default to `left <= right` guard. By checking the guards statically before, we can avoid that. ```python N = 16 @torch.compile(backend="inductor", dynamic=False, fullgraph=True) def fn(x): splits = torch.ops.aten.split.Tensor(x, N) first = splits[0] return torch.ops.aten.slice.Tensor(first, 0, 0, N) x = torch.arange(N) torch._dynamo.mark_dynamic(x, 0) fn(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142372 Approved by: https://github.com/ezyang	2024-12-11 18:52:17 +00:00
Blaine Burton Rister	8d24eb0c94	[Inductor] Represent size_hints as a dict (#142249 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.) # Test plan The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249 Approved by: https://github.com/jansel	2024-12-09 22:31:53 +00:00
Bin Bao	4d43ec2189	[AOTI] Swith GPU codegen to one-pass (#141980 ) Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption. Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980 Approved by: https://github.com/chenyang78	2024-12-09 14:40:34 +00:00
Blaine Burton Rister	0c66cee9a2	[Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864 ) # Feature Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16. This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast. # Test Plan Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex. Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts. This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864 Approved by: https://github.com/eellison, https://github.com/arui-meta Co-authored-by: eellison <elias.ellison@gmail.com>	2024-12-08 19:42:48 +00:00
PyTorch MergeBot	fc831f76f8	Revert "[Inductor] Represent size_hints as a dict (#142249 )" This reverts commit f870ee2cc4f3dd1babd3043b5291d54f487a2999. Reverted https://github.com/pytorch/pytorch/pull/142249 on behalf of https://github.com/blaine-rister due to would break internal tests ([comment](https://github.com/pytorch/pytorch/pull/142249#issuecomment-2524991008))	2024-12-07 07:43:51 +00:00
Blaine Burton Rister	f870ee2cc4	[Inductor] Represent size_hints as a dict (#142249 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.) # Test plan The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249 Approved by: https://github.com/jansel	2024-12-07 06:43:05 +00:00
PyTorch MergeBot	f36cccba2e	Revert "[Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864 )" This reverts commit 80ca6dd892613fd4f1dee9040b8273ddeadb1c50. Reverted https://github.com/pytorch/pytorch/pull/140864 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/140864#issuecomment-2524168602))	2024-12-06 21:03:06 +00:00
Boyuan Feng	61a7c83c64	[Inductor] fix device error for NopKernelSchedulerNode (#141372 ) This PR adds device guard support for NopKernelSchedulerNode which may create a tensor. Prior to this PR, we do not codegen device guard for NopKernelSchedulerNode, leading to errors. Prior to the PR: ```python def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args args.clear() assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1)) buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # TODO: ERROR here. Should be cuda:1 with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream1 = get_raw_stream(1) breakpoint() triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 return (buf1, ) ``` After the PR: ```python def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args args.clear() assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1)) with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # New: move into device guard buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream1 = get_raw_stream(1) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 return (buf1, ) ``` Fixes #141010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141372 Approved by: https://github.com/eellison	2024-12-06 19:27:50 +00:00
Blaine Burton Rister	80ca6dd892	[Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864 ) # Feature Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16. This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast. # Test Plan Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex. Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts. This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864 Approved by: https://github.com/eellison, https://github.com/arui-meta Co-authored-by: eellison <elias.ellison@gmail.com>	2024-12-06 03:15:20 +00:00
Mwiza Kunda	f0b33658f8	Dont use constant mask if ynumel potentially overflows ygrids (#139751 ) If (ynumel / YBLOCK) > get_max_ygrids(), the z dimension will be used if znumel is None. However, if (ynumel / YBLOCK) % get_max_ygrids() != 0, there will be program launches with inputs that require masking, and so this needs to be considered when determining if the y dimension has a constant mask. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139751 Approved by: https://github.com/eellison Co-authored-by: George White <georgew@graphcore.ai>	2024-12-03 22:56:18 +00:00
Benjamin Glass	4959784dac	Add API query for available per-process CUDA memory (#140620 ) Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit. Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620 Approved by: https://github.com/malfet, https://github.com/eqy ghstack dependencies: #141367	2024-12-03 00:24:03 +00:00
eellison	f83361b274	inductor dtype propagation fixes (#141495 ) - Add in upcast_compute_type on creation of new tensors (loads, constants) - Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr. - bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16. - for masked, avoid calling dtype propagation and just use output dtype. Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32. Follow ups: - We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr. - Be more consistent on our index expr dtype printing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495 Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang ghstack dependencies: #139945, #140057	2024-11-28 11:39:38 +00:00
Boyuan Feng	17fd53d8e5	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-27 18:51:52 +00:00
Benjamin Glass	381213ee8a	test_torchinductor: Improve cpp_wrapper skip message (#141176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141176 Approved by: https://github.com/desertfire	2024-11-27 16:35:54 +00:00
PyTorch MergeBot	65dbd5cc2d	Revert "[Inductor] Inplacing with Donated Buffer (#140113 )" This reverts commit eecc8e362c2eb192cbe13322af941d09ca647a6b. Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))	2024-11-26 21:20:59 +00:00
Isuru Fernando	44186a0a4e	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-26 18:11:00 +00:00
Boyuan Feng	eecc8e362c	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-26 17:19:50 +00:00
Benjamin Glass	efec302dd0	cpp_wrapper tests: Fix tests assuming non-cpp_wrapper code (#141175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141175 Approved by: https://github.com/desertfire	2024-11-25 19:33:55 +00:00
PyTorch MergeBot	f23621ec56	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit c25b201583fc28243b87c460a2f18e2531a676e7. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))	2024-11-22 15:43:39 +00:00
Isuru Fernando	c25b201583	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-22 02:04:36 +00:00
Yukio Siraichi	3b67d4d687	[inductor] Don't clamp on `split` operation. (#141078 ) This PR turns clamping off for the `split` operation. By doing so, we generate less bound guards and reduce the number of recompilation when varying the input size. ```python @torch.compile(dynamic=True) def f(x): return x.chunk(4) >>> f(torch.arange(12)) (tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([ 9, 10, 11])) >>> f(torch.arange(11)) (tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([ 9, 10])) >>> f(torch.arange(10)) (tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([9])) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141078 Approved by: https://github.com/ezyang ghstack dependencies: #141077	2024-11-21 13:53:38 +00:00
Yukio Siraichi	154f90f026	[inductor] Don't specialize `split` on `sizes` parameter. (#141077 ) Fix: #139936 This PR modifies the lowering of `split` operation, so that it won't generate guards, specializing on the sizes parameter. Instead, it specializes on the number of output tensors being generated (i.e. function of the size of the base tensor, and the sizes parameter). As a result, operations such as `chunk` (whose number of output tensors usually is constant given a static chunk number) won't trigger recompiles when varying the size of the base tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141077 Approved by: https://github.com/ezyang	2024-11-21 13:53:38 +00:00
Aleksei Nikiforov	a82bab6419	Run only listed tests on s390x (#140265 ) Skip tests that are failing This was previously part of https://github.com/pytorch/pytorch/pull/125401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140265 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-20 22:53:09 +00:00
PyTorch MergeBot	701e06b643	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit aefcdb3c9fa787f9d43864f6f99a3590c914324a. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it fails inductor/test_padding in trunk. This is a target determination miss and that failed test was not run in your PR ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2489641453))	2024-11-20 22:13:57 +00:00
Isuru Fernando	aefcdb3c9f	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-20 20:26:49 +00:00
Chien-Lin Chen	161425ff9f	Added aten.bernoulli.p and aten.bernoulli.default decompositions (#139141 ) Fixes #105519 Added aten.bernoulli.p decomposition and moved/rewrote aten.bernoulli.deafult to make them included in core aten decomposition. Tested the sample code in [105519](https://github.com/pytorch/pytorch/issues/105519), torch.bernoulli could be decomposed by the code snippet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139141 Approved by: https://github.com/eellison	2024-11-20 19:52:57 +00:00
Benjamin Glass	4ffce45100	AOTInductor: properly generate cpp_wrapper runtime assertions (#141050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141050 Approved by: https://github.com/desertfire ghstack dependencies: #141058	2024-11-20 19:17:47 +00:00
Benjamin Glass	5c684503a6	cpp_wrapper: Fix dtype_view wrapping reinterpret_view (#141058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141058 Approved by: https://github.com/desertfire	2024-11-20 19:17:47 +00:00
Shunting Zhang	7e9e83a8c6	[inductor] force contiguous layout for implicit fallback (#140996 ) Fix https://github.com/pytorch/pytorch/issues/140462 . Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous. I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140996 Approved by: https://github.com/jansel	2024-11-20 06:41:17 +00:00
Benjamin Glass	6ccd35ccb8	cpp_wrapper: Fix searchsorted fallback ops (#140817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140817 Approved by: https://github.com/desertfire ghstack dependencies: #140624, #140634	2024-11-19 23:34:20 +00:00
Benjamin Glass	ce15d1ebc2	Narrow the scope of several cpp_wrapper test skips (#140634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140634 Approved by: https://github.com/desertfire ghstack dependencies: #140624	2024-11-19 23:34:20 +00:00
Benjamin Glass	34b2165bdb	Insert aten.add into fallback_ops, and fix Tensor -> Scalar conversion in ir.FallbackKernel (#140624 ) The code in ir.FallbackKernel will long-term be obviated by the solution for #90923. Closes #131334. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140624 Approved by: https://github.com/desertfire	2024-11-19 23:34:20 +00:00
Bin Bao	62fb6fd8bd	Fix broken AOTInductor node and kernel counts (#139435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435 Approved by: https://github.com/desertfire ghstack dependencies: #139411, #139412 Co-authored-by: Bin Bao <binbao@meta.com>	2024-11-17 04:17:07 +00:00
Bin Bao	83e62cbc18	Enable all fixed cpp_wrapper tests (#139412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139412 Approved by: https://github.com/desertfire ghstack dependencies: #139411 Co-authored-by: Bin Bao <binbao@meta.com>	2024-11-17 04:17:07 +00:00
Bin Bao	819b0ebd94	cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411 ) Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use. Closes #135559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411 Approved by: https://github.com/desertfire Co-authored-by: Bin Bao <binbao@meta.com>	2024-11-17 04:16:59 +00:00
PyTorch MergeBot	222d4b48b1	Revert "cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411 )" This reverts commit 761b42bc085190e272a930847694e872d92a1255. Reverted https://github.com/pytorch/pytorch/pull/139411 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))	2024-11-14 19:25:46 +00:00
PyTorch MergeBot	25048e5381	Revert "Enable all fixed cpp_wrapper tests (#139412 )" This reverts commit fef16fe254da2f9598c6f8bb19fdd883e5a54971. Reverted https://github.com/pytorch/pytorch/pull/139412 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))	2024-11-14 19:25:46 +00:00
PyTorch MergeBot	14641c0393	Revert "Fix broken AOTInductor node and kernel counts (#139435 )" This reverts commit 8cb0b932a16ee69137287b4e3872ffd39a79a8d4. Reverted https://github.com/pytorch/pytorch/pull/139435 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))	2024-11-14 19:25:46 +00:00

1 2 3 4 5 ...

946 Commits