pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Jason Ansel	d189f92eb1	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-04 04:28:18 +00:00
PyTorch MergeBot	f3238106fd	Revert "Allow inplacing buffer when other users are inconsequential (#138383 )" This reverts commit 030f70b40bca62993bd65d03c58ded45601abe35. Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but I think it has a test failing internally and also on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2452898229))	2024-11-02 06:53:48 +00:00
PyTorch MergeBot	0863d6a08e	Revert "[inductor] Remove SIMDKernel.last_usage (#139364 )" This reverts commit 286d3ce266ce01ca905afb1cc9ea5d81abf79ff7. Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:11 +00:00
PyTorch MergeBot	9331640e26	Revert "[inductor] Remove Node.last_usage mutation (#139365 )" This reverts commit 1e934b473cabe6bc003f66d9811082e97c958a31. Reverted https://github.com/pytorch/pytorch/pull/139365 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	dc4b459737	Revert "[inductor] Move remove_kernel_local_buffers to Kernel (#139370 )" This reverts commit b57b4b7f9b168389def15ea06a4dcf9e5f6f4f04. Reverted https://github.com/pytorch/pytorch/pull/139370 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	66a401c9e1	Revert "[inductor] Simplify remove_kernel_local_buffers (#139452 )" This reverts commit 73c0762a34ef152450287dbc365cb8db930031b7. Reverted https://github.com/pytorch/pytorch/pull/139452 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	98e11b0021	Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 )" This reverts commit c53beab3775671b5b7ec6106737c0d8939b8455a. Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
Jason Ansel	c53beab377	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-02 03:04:22 +00:00
Jason Ansel	73c0762a34	[inductor] Simplify remove_kernel_local_buffers (#139452 ) I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365, #139370	2024-11-01 20:36:39 +00:00
Jason Ansel	b57b4b7f9b	[inductor] Move remove_kernel_local_buffers to Kernel (#139370 ) This method mutates the kernel, so it fits better in that class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365	2024-11-01 16:28:15 +00:00
Jason Ansel	1e934b473c	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-01 16:28:15 +00:00
Jason Ansel	286d3ce266	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 16:28:15 +00:00
Shunting Zhang	5e4c8b671c	[inductor] loaf-fix (#139376 ) Fix https://github.com/pytorch/pytorch/issues/128063 . Now for this snippet ``` def f(x): y = torch.sum(torch.sum(x, dim=-1)) z = x / 10.0 z_t = z.t().contiguous().t() return y, z, z_t ``` Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused). The PR needs fix 2 subtile bugs regarding LOAF . 1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison. 2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376 Approved by: https://github.com/jansel, https://github.com/eellison	2024-11-01 07:54:32 +00:00
Gabriel Ferns	030f70b40b	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-11-01 01:24:40 +00:00
PyTorch MergeBot	289e03a429	Revert "Allow inplacing buffer when other users are inconsequential (#138383 )" This reverts commit 8840889c3f6565b7975150adebcbe062f19035ee. Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2448824206))	2024-10-31 01:32:15 +00:00
Gabriel Ferns	8840889c3f	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-10-30 21:35:50 +00:00
Yifu Wang	7765d1ef70	Preliminary registered-buffer collective support via Inductor (#138029 ) ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029 Approved by: https://github.com/Chillee ghstack dependencies: #138028	2024-10-30 18:11:09 +00:00
Jason Ansel	3217ae2082	[inductor] Only apply score_fusion_memory_threshold to horizontal fusions (#138970 ) PR #136782 made `x.sum()+1` become two kernels, which hurts compile times as @ezyang noticed and breaks a lot of the tests in this stack. This reworks that heuristic to not apply as often. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138970 Approved by: https://github.com/shunting314	2024-10-27 16:31:38 +00:00
Shunting Zhang	0a38c0ec89	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-22 00:50:00 +00:00
PyTorch MergeBot	ac7f52b301	Revert "[inductor] add a threshold for membw saving during fusion (#136782 )" This reverts commit 6647320de2077c10309f5025a007d51c7fb542d8. Reverted https://github.com/pytorch/pytorch/pull/136782 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_memory starts to fail after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/136782#issuecomment-2423549196))	2024-10-19 03:43:42 +00:00
Shunting Zhang	6647320de2	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-19 00:22:43 +00:00
Jason Ansel	4632594546	[inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252 ) There are some places where it would be nice to use this, but the scheduler hasn't yet been created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252 Approved by: https://github.com/eellison ghstack dependencies: #138170	2024-10-18 23:05:54 +00:00
Artemiy Bulavin	74e871355b	Add hooks to Scheduler nodes for generating device-specific debug strings (#135015 ) Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015 Approved by: https://github.com/jansel	2024-10-11 20:30:49 +00:00
eellison	4aed81c0db	Add support for cat memory planning mms with max autotune (#132554 ) When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](`bcac71517c/torch/_inductor/kernel/mm.py (L156)`). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554 Approved by: https://github.com/jansel	2024-10-08 22:36:46 +00:00
PyTorch MergeBot	493d0eeef3	Revert "Add support for cat memory planning mms with max autotune (#132554 )" This reverts commit d558ec07300defee24dd4a83ab4b387a39ea2176. Reverted https://github.com/pytorch/pytorch/pull/132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/132554#issuecomment-2398946854))	2024-10-08 06:21:06 +00:00
eellison	d558ec0730	Add support for cat memory planning mms with max autotune (#132554 ) When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat. Discussion for reviewers: It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](`bcac71517c/torch/_inductor/kernel/mm.py (L156)`). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation ``` class AllocatedFixedLayout(FixedLayout) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554 Approved by: https://github.com/jansel	2024-10-07 22:49:29 +00:00
Gabriel Ferns	36fb342ffd	Check for fused kernel before inplace update (#137042 ) Summary: Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so. Test Plan: New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers. Fixes #120217 Here's a diagram of the issue: ![Inline+Fusion](https://github.com/user-attachments/assets/c03308d8-fdbf-40a0-a46d-964ece5f9e6d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042 Approved by: https://github.com/eellison	2024-10-02 21:14:34 +00:00
Jez Ng	71aac59e93	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-30 20:24:52 +00:00
PyTorch MergeBot	36428f91e9	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit 31c0467594c7c41c8e8ff1828bf01fa31fc4454f. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))	2024-09-27 16:54:27 +00:00
eellison	aa56f80ec1	Dont pairwise check unfusable nodes in scheduler (#136682 ) Gives 8% wall time speedup on n=1000 benchmark in https://github.com/pytorch/pytorch/pull/136429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136682 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/shunting314	2024-09-26 23:46:52 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
Xuan Zhang	03957efa5d	[inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874 ) Motivations: A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf). Solutions: 1. implement a peak memory estimator via liveness analysis 2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory Results: On some models we can reduce the peak memory significantly: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:-----------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| alexnet \| 128 \| 1.17 \| 0.99 \| 1.19 \| \| vgg16 \| 64 \| 4.10 \| 3.57 \| 1.15 \| \| DebertaV2ForQuestionAnswering \| 1 \| 11.60 \| 10.56 \| 1.10 \| In the presence of compiler based AC, peak memory can be further reduced: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:------------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| AlbertForMaskedLM \| 4 \| 6.87 \| 6.43 \| 1.07 \| \| AlbertForQuestionAnswering \| 4 \| 8.69 \| 7.76 \| 1.12 \| \| MobileBertForQuestionAnswering \| 128 \| 4.67 \| 3.90 \| 1.20 \| [Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case. Other infos: * neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_. * minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second. * no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874 Approved by: https://github.com/yf225	2024-09-21 16:28:38 +00:00
Aaron Gokaslan	31715be72a	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-16 19:44:11 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
PyTorch MergeBot	3117f2cf67	Revert "[BE]: Update mypy to 1.11.2 (#133816 )" This reverts commit 55299cfc223fa838aadd8d6d6fa3ed541fa5acd1. Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))	2024-09-16 09:11:16 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
Aaron Gokaslan	55299cfc22	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-14 21:40:36 +00:00
Jason Ansel	d3aab9642b	[inductor] Optimize can_fuse_vertical() (#135788 ) An O(n^2) to O(n) improvement by not comparing all pairs of deps. Before: ![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9) After: ![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788 Approved by: https://github.com/oulgen ghstack dependencies: #135787	2024-09-13 00:18:41 +00:00
Jason Ansel	67a929eea8	[inductor] Remove unused check (#135787 ) I think this is unreachable code because mode is always None on reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787 Approved by: https://github.com/oulgen	2024-09-13 00:18:41 +00:00
Jason Ansel	6354271178	[inductor] Skip unused call to get_estimated_runtime() (#135776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776 Approved by: https://github.com/oulgen ghstack dependencies: #135445, #135446	2024-09-12 05:22:23 +00:00
Jason Ansel	12902f6ecf	[inductor] Cache get_operation_names/get_buffer_names (#135446 ) Before: ![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311) After: ![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446 Approved by: https://github.com/oulgen ghstack dependencies: #135445	2024-09-12 05:22:23 +00:00
Jason Ansel	53290ca00b	[inductor] Refactor BaseSchedulerNode.__init__ (#135400 ) Might be a small compile time improvement since we remove a call to extract_read_writes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135400 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377	2024-09-08 18:02:36 +00:00
Jason Ansel	16f5155992	[inductor] Fast path for extract_read_writes without tracing (#135377 ) Before (bottom of stack): ![image](https://github.com/user-attachments/assets/13060ff9-b31d-42a9-8e8f-c50b2bf3dc2f) After (this PR): ![image](https://github.com/user-attachments/assets/7d190821-b614-46b7-9e9e-9087443df654) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135377 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306	2024-09-08 18:02:32 +00:00
Jason Ansel	37144be03d	[inductor] Remove ReadWrites.op_counts (#135306 ) This was (almost) unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135306 Approved by: https://github.com/oulgen ghstack dependencies: #135286	2024-09-08 18:02:28 +00:00
Jason Ansel	eac5e12548	[inductor] Move LoopBody to its own file (#135257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257 Approved by: https://github.com/oulgen	2024-09-07 16:29:15 +00:00
Jason Ansel	ea231300d1	[inductor] Improve compile time regression from MemoryDep.normalize (#135070 ) Possible fix for #135056 Before ![image](https://github.com/user-attachments/assets/3962cb85-e808-4fd4-991f-471ff5ef7eae) After ![image](https://github.com/user-attachments/assets/2322d48d-6518-4518-baca-336027b5cda8) Measured based on: ``` python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --training --only hf_Bert_large --stats -n1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135070 Approved by: https://github.com/Chillee	2024-09-05 23:41:30 +00:00
Sun, Jiayi	13a4a0c60d	[Inductor] Apply loop split optimization in codegen_node (#132389 ) This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load. Example: ``` import torch import torch.nn as nn class GNReLU(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GNReLU, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return torch.nn.functional.relu(self.gn(x)) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GNReLU(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0))); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 14); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360L*x0)), 14); } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-04 22:42:46 +00:00
Shunting Zhang	873abfc18e	[inductor] fix compile time regression due the (disabled) loop ordering after fusion (#135071 ) It's a bit surprised that the code added in Scheduler.fusable_read_and_write would increase compilation time. Here are some number I get from a H100 on BertForMaskedLM: - without the fix, cold start compilation time is around 82s - with the fix, cold start compilation time is around 76s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135071 Approved by: https://github.com/jansel	2024-09-04 18:36:59 +00:00
PyTorch MergeBot	f927bcb934	Revert "[Inductor] Apply loop split optimization in codegen_node (#132389 )" This reverts commit 3cb5d251224b3fb59b5a10c6fefbb4c84eb565a6. Reverted https://github.com/pytorch/pytorch/pull/132389 on behalf of https://github.com/ZainRizvi due to Hi, this seems to be breaking in trunk. See test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10660461216/job/29556282081) [HUD commit link](`de3a641476`) ([comment](https://github.com/pytorch/pytorch/pull/132389#issuecomment-2326843129))	2024-09-03 15:40:45 +00:00
Edward Z. Yang	ee03530fd9	Add a test to avoid decorator based regression for cprofile traces (#133086 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133086 Approved by: https://github.com/aorenste	2024-09-02 12:53:34 +00:00

... 3 4 5 6 7 ...

513 Commits