pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
xinan.lin	2ffb510942	[Break XPU][Indutor UT] Fix failures introduced by community. (#159463 ) Fixes #159000, Fixes #159335, Fixes #159334, Fixes #159332, Fixes #159331, Fixes #159330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159463 Approved by: https://github.com/jansel	2025-07-31 08:37:41 +00:00
Xuehai Pan	17687eb792	[BE][4/6] fix typos in test/ (test/inductor/) (#157638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638 Approved by: https://github.com/yewentao256, https://github.com/jansel	2025-07-06 06:34:25 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
henrylhtsang	02cecd1018	[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 ) Differential Revision: [D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/) Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506 Approved by: https://github.com/ColinPeppler	2025-04-21 20:14:34 +00:00
PyTorch MergeBot	e434a9152e	Revert "[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 )" This reverts commit 6246c7d62ca2f091838d5c707e3d932994c5e35a. Reverted https://github.com/pytorch/pytorch/pull/151506 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/151506#issuecomment-2815999009))	2025-04-18 18:40:17 +00:00
henrylhtsang	6246c7d62c	[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506 ) Differential Revision: [D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/) Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506 Approved by: https://github.com/ColinPeppler	2025-04-18 17:26:16 +00:00
Jason Ansel	b040dc3a53	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential [disconnected] Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-12 15:52:16 +00:00
PyTorch MergeBot	5ada4e6a53	Revert "Reland: [inductor] Simplify grid handling (#148305 )" This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178. Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))	2025-03-12 14:58:43 +00:00
Jason Ansel	8d08b49015	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-11 18:51:06 +00:00
Wang, Eikan	6a3a1f96ce	Enable XPU for Inductor MM Triton Kernel Benchmark (#148237 ) #147620 enabled `force_shape_pad` for triton kernel benchmark. Intel GPU supports this scenario. Hence, we need to enable the case in this PR. Otherwise, there would be a test case regression for Intel GPU as #147620 has been landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148237 Approved by: https://github.com/jansel	2025-03-03 10:09:06 +00:00
PyTorch MergeBot	608377d341	Revert "[import][inductor] Simplify grid handling (#147583 )" This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e. Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))	2025-03-03 00:49:32 +00:00
Jason Ansel	b59776d857	[import][inductor] Simplify grid handling (#147583 ) Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Note the attached diff contains some minor fbcode-only changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-03-02 07:31:07 +00:00
xinan.lin	762724f3d0	[Break XPU][Inductor] Generalize device-bias code and fix test_graph_partition for XPU (#148178 ) This PR generalized the device-bias code introduced by #147038 . And align the behavior between XPU and CUDA on add + mm + pointwise pattern (for XPU, from addmm + pointwise to mm + fused_add_pointwise) , which fix the failed test case `test_graph_partiton` on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148178 Approved by: https://github.com/benjaminglass1, https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #148155	2025-03-01 10:59:55 +00:00
iupaikov-amd	6061664266	Enabled force_shape_pad for triton tests in test_kernel_benchmark (#147620 ) During ROCm runs we naturally have those tests show that padding path will be slower for our archs and the pad_mm chooses to opt out of padding thus failing those tests. Reasoning for this is per my understanding those tests don't check IF the operation should be padded in the first place, but HOW is it padded and if it's done in a correct way. More than that the tests shouldn't really be hardware dependent or have some condition for them. Similar PR for reference: https://github.com/pytorch/pytorch/pull/141768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147620 Approved by: https://github.com/jeffdaily, https://github.com/chenyang78, https://github.com/shunting314	2025-02-25 18:06:48 +00:00
xinan.lin	934eaa503f	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-30 23:51:17 +00:00
PyTorch MergeBot	844e6108f6	Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 )" This reverts commit ad750ae32079020f51f9b7d01237f3ecfa83b6ff. Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))	2024-12-24 17:22:57 +00:00
Iurii Paikov	dbbc81cb34	Enabled force_shape_pad for test_pad_mm and test_slice_mm_bandwidth_computation (#141768 ) Some tests fail for ROCm build on navi arch because of this check: `f83361b274/torch/_inductor/fx_passes/pad_mm.py (L211)` There is no need to determine if mm is compute bound for most of the padding tests since they don't specifically test compute bound behavior. We don't have enough empirical data to fine tune this check for AMD gpus yet. I propose to force the shape padding for the tests that we had trouble with to avoid this unnecessary logic path. Please correct me if I didn't add other tests that can potentially fail with this issue or if I added a test that is dependent on logic below the `force_shape_pad` check here: `f83361b274/torch/_inductor/fx_passes/pad_mm.py (L444)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141768 Approved by: https://github.com/jeffdaily	2024-12-24 11:03:39 +00:00
xinan.lin	ad750ae320	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-24 05:42:36 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Nikita Shulga	38bbe37187	Enable CI on SM89 (#140305 ) Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376 To enable more balanced sharding, had to push `148ae19935` Added `@xfailIfSM89` to the following tests: - test_fp8_pattern_2 - test_original_aten_preserved_split_addmm - test_sparse_semi_structured_scaled_mm - test_sparse_semi_structured_scaled_mm_fp8 - test_sparse_fp8fp8_mm Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA` Skipped following inductor tests (that either flaky OOMs or timeouts): - test_reduction_fn_std_float64 - test_reduction_fn_var_mean_float64 - test_multi_output_unbacked_custom_op Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305 Approved by: https://github.com/wdvr, https://github.com/ZainRizvi	2024-12-03 04:49:46 +00:00
Sam Larsen	d8b606ecb5	[fx graph cache] Support freezing with FX graph caching (#136505 ) Summary: The main changes to support freezing are: 1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata. 2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places. 3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded. 4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along. Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505 Approved by: https://github.com/eellison	2024-11-01 18:29:29 +00:00
Ruben Rodriguez Buchillon	b1b6816e05	[testing] reenable kernel_benchmark.py tests (#136876 ) Summary: # Why We want this to run internally # What - fix python path issue on the test - reenable the test # Background (copied from similar issue resolved earlier) It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark Differential Revision: D63498897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876 Approved by: https://github.com/henrylhtsang	2024-10-01 17:16:21 +00:00
Ahmad Sarvmeily	9a998d98f1	Fix edge case in inductor triton clean script (#130837 ) The regex in the script is too restrictive, as it excludes examples with parentheses in args, like the following: ``` triton_poi_fused_add_0.run(arg0_1.item(), arg1_1.item(), buf0, 1, grid=grid(1), stream=streamNone) ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130837 Approved by: https://github.com/Chillee	2024-08-19 23:46:11 +00:00
Peter Bell	2784b3f1b7	[inductor] Fix split-scan interaction with multi-kernel (#131044 ) This fixes a couple errors that come up when multi-kernel is used with split-scan. 1. The split-scan was being marked as a persistent kernel, which allowed a multi-kernel to be created but this isn't supported. Fix is to never mark split-scan as persistent. 2. Benchmark codegen was not handling WorkspaceArg, and would raise a KeyError during codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044 Approved by: https://github.com/shunting314	2024-07-25 11:36:36 +00:00
James Wu	63d7ffe121	Retry of D58015187 Move AsyncCompile to a different file (#127691 ) Summary: This is a retry of https://github.com/pytorch/pytorch/pull/127545/files and D58015187, fixing the internal test that also imported codecache Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now Differential Revision: D58054611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691 Approved by: https://github.com/oulgen	2024-06-03 15:29:41 +00:00
PyTorch MergeBot	22f392ba40	Revert "[easy?] Move AsyncCompile to a different file (#127235 )" This reverts commit f58fc16e8f059232f452a333f32e14ff681e12af. Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))	2024-06-01 17:16:16 +00:00
PyTorch MergeBot	d49dc8f4b8	Revert "Add noqa to prevent lint warnings (#127545 )" This reverts commit f9937afd4f87fbb4844642ae2f587b13b5caa08c. Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))	2024-06-01 17:12:46 +00:00
James Wu	f9937afd4f	Add noqa to prevent lint warnings (#127545 ) This is to prevent the import from being removed due to unused import. What's annoying about this is that it's not consistently running: lintrunner doesn't warn me on this PR even without the comment, but it does on other PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127545 Approved by: https://github.com/masnesral	2024-05-30 17:56:49 +00:00
James Wu	f58fc16e8f	[easy?] Move AsyncCompile to a different file (#127235 ) By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes. To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral	2024-05-30 02:43:02 +00:00
PyTorch MergeBot	d49abf039a	Revert "update pointwise cat heuristics (#125772 )" This reverts commit d19d932183f265f5108e6cc30f514d88060a67d7. Reverted https://github.com/pytorch/pytorch/pull/125772 on behalf of https://github.com/izaitsevfb due to Fails numerical stability test for aps model, see D57215900 ([comment](https://github.com/pytorch/pytorch/pull/125772#issuecomment-2105932504))	2024-05-11 15:27:44 +00:00
eellison	d19d932183	update pointwise cat heuristics (#125772 ) Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat: - fusing into a pointwise use - horizontally fusing copy_ kernels The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case that we would have to emit separate copy_ kernels anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772 Approved by: https://github.com/Chillee	2024-05-10 01:07:39 +00:00
chilli	fd816bf630	Add script for removing Inductor dependencies from Inductor generated code (#125811 ) Usage: ```python TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python foo.py TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1 python /tmp/torchinductor_chilli/js/cjsbczkf6fj36nhaxxypll6cy4fmwmkoauklrgrvuody2mn7oeef.py python remove_inductor_deps.py /tmp/torchinductor_chilli/js/cjsbczkf6fj36nhaxxypll6cy4fmwmkoauklrgrvuody2mn7oeef.py ``` Example generated code: https://pastebin.com/m6Ae8heB Pull Request resolved: https://github.com/pytorch/pytorch/pull/125811 Approved by: https://github.com/chenyang78	2024-05-10 00:00:25 +00:00
xinan.lin	78a1693266	[Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 1) (#122866 ) Reuse Inductor test suite for Intel GPU including: test_torchinductor.py test_triton_wrapper.py test_metrics.py test_codecache.py test_codegen_triton.py test_kernel_benchmark.py test_triton_heuristics.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/122866 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-05-09 00:51:35 +00:00
Sam Larsen	4cd503c1f3	Enable FX graph cache for a batch of inductor tests (#121696 ) Summary: Get more FX graph cache coverage by enabling it for these unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696 Approved by: https://github.com/eellison	2024-03-14 03:39:59 +00:00
Shunting Zhang	800e9acd43	[inductor] fix bandwidth extimation for StarDep (#120266 ) A lot of HF models fail when inductor_config.bechmark_kernel is enabled. The reason is the bandwidth estimation code assumes every dependencies has an index but StarDep does not. An exception is raised when StarDep.index is being accessed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120266 Approved by: https://github.com/eellison, https://github.com/jansel	2024-02-21 03:33:45 +00:00
Yang Chen	61b572ed56	[inductor] more accurate throughput calculations for kernel benchmarks (#118858 ) Our current throughput calculations for kernel benchmarks have some issues, particularly when we slice inputs in the kernel. In such cases, we count the original inputs as part of the memory traffic passed across the kernel. This is incorrect because it may result in a much larger throughput calculation, which can even exceed the theoretical bandwidth. Instead, we should only count the size of the "slices" that contribute to the actual memory traffic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118858 Approved by: https://github.com/jansel	2024-02-01 21:42:14 +00:00
Yang Chen	1565d58ad9	[inductor] correctly generate grid info for benchmark_kernel (#118202 ) Previously, we generated the grid argument with tree.numel for a benchmark TritonKernel. This was not correct, because it didn't match the launch config used for profiling and running. This PR fixed the issue by emitting the grid value computed by the kernel's grid_fn, which is used by the profiler and the kernel's runner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118202 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-01-25 20:37:44 +00:00
xinan.lin	e60bc502b4	[Inductor Intel GPU backend Upstream] Generalize part of Inductor test case (#117513 ) Following the RFC https://github.com/pytorch/pytorch/issues/114856, before upstream Intel XPU Inductor Backend, we need to preapre corresponding Inductor test cases. This PR aims to generalize part of Inductor test case so that a new GPU backend can reuse the existing test case with minimal code change. This Pull Request preferentially generalizes the test cases that cover Inductor's base functionality as follow: - test/inductor/test_codecache.py - test/inductor/test_codegen_triton.py - test/inductor/test_kernel_benchmark.py - test/inductor/test_torchinductor.py - test/inductor/test_torchinductor_codegen_dynamic_shapes.py - test/inductor/test_torchinductor_dynamic_shapes.py - test/inductor/test_torchinductor_opinfo.py - test/inductor/test_triton_heuristics.py - test/inductor/test_triton_wrapper.py Feature request: https://github.com/pytorch/pytorch/issues/114856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117513 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-01-18 08:26:21 +00:00
Shunting Zhang	68a8d74f3f	[inductur] benchmark epilogue fused matmul template (#114809 ) Want to be a able to benchmark epilogue fused triton matmul kernel for a couple of reasons 1. @eellison found that certain TB models (resnet50, resnet152, moco) fails sometimes in maxautotune mode on the dashboard. The issue is quite hard to repro due to flakiness. The issue only get triggered when certain triton config for certain epilogue fused kernel get picked. (disable epilogue fusion bypass the issue) It would be nice if we can have a runnable script that directly run that kernel to ease further debugging 2. this is a necessary piece to do benchmark fusion for triton matmul kernels. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler for this Example runnable kernel script: https://gist.github.com/shunting314/00bdbc1b6b46bfa73d1389d8f40cd669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114809 Approved by: https://github.com/eellison	2023-12-01 21:05:01 +00:00
PyTorch MergeBot	1e60174891	Revert "[dynamo] Add run_inductor_tests entrypoint (#113278 )" This reverts commit b00311ce9e430cf1b98d2103e21ed2179450a424. Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))	2023-11-15 01:19:48 +00:00
Jason Ansel	b00311ce9e	[dynamo] Add run_inductor_tests entrypoint (#113278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278 Approved by: https://github.com/yanboliang	2023-11-11 08:54:43 +00:00
Jason Ansel	64f326097b	[dynamo] Refactor handling of state in context managers (#112939 ) The prior handling was rather buggy... Pull Request resolved: https://github.com/pytorch/pytorch/pull/112939 Approved by: https://github.com/voznesenskym, https://github.com/yanboliang ghstack dependencies: #112897, #112898, #112920, #112899	2023-11-05 03:10:30 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Jack Taylor	ede1965f5d	Enable additional inductor test suites on ROCm (#102270 ) Enables additional inductor UTs on ROCm, following from https://github.com/pytorch/pytorch/pull/100981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102270 Approved by: https://github.com/malfet	2023-06-22 00:36:35 +00:00
Jack Taylor	187eb7ca88	Enable default workflow PyT 2.0 UTs on ROCm stack (#100981 ) PR to enable default workflow PyTorch 2.0 unit tests for the ROCm stack. - Enables all the dynamo unit test suites - Enables some of the inductor unit test suites - `test_config` - `test_cpp_wrapper` (cpu only) - `test_minifier` - `test_standalone_compile` - `test_torchinductor_dynamic_shapes` - `test_torchinductor_opinfo` - `test_torchinductor` - `test_triton_wrapper` - Introduces TEST_WITH_ROCM conditions for unit test skip/fail dictionaries in test_torchinductor_dynamic_shapes.py and test_torchinductor_opinfo.py Note this PR follows on from the discussions for the previous UT enablement PR https://github.com/pytorch/pytorch/pull/97988, we have opted to only enable a few inductor suites at the moment to ease the upstreaming effort as these files are changing very quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100981 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-05-15 23:45:04 +00:00
Shunting Zhang	e1f44ee3b3	[inductor] correctly setup constant in the wrapper (#97571 ) V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable. Add 'global var_name' lines to maintain the same behavior as before. Test: Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix. Thanks @ngimel for report the issue with repro and @Chillee for pointing out the root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571 Approved by: https://github.com/ngimel	2023-03-28 03:10:53 +00:00
Shunting Zhang	13398d8b95	[inductor] improve bandwidth computation (#97057 ) When we compute bandwidth for an kernel, we should double the memory usage for inplace arguments since we need read them once and write them once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97057 Approved by: https://github.com/Chillee	2023-03-20 20:30:46 +00:00
Shunting Zhang	9aa216cb46	reland #96249 : [inductor] show more kernel specific metrics in the benchmark result (#96461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96461 Approved by: https://github.com/ngimel	2023-03-10 06:18:21 +00:00

48 Commits