pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Blaine Burton Rister	520ba556cd	[Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land. The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR. These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf. # Test plan The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.` to `r0_.`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-12 17:22:20 +00:00
Blaine Burton Rister	8d24eb0c94	[Inductor] Represent size_hints as a dict (#142249 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.) # Test plan The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249 Approved by: https://github.com/jansel	2024-12-09 22:31:53 +00:00
PyTorch MergeBot	fc831f76f8	Revert "[Inductor] Represent size_hints as a dict (#142249 )" This reverts commit f870ee2cc4f3dd1babd3043b5291d54f487a2999. Reverted https://github.com/pytorch/pytorch/pull/142249 on behalf of https://github.com/blaine-rister due to would break internal tests ([comment](https://github.com/pytorch/pytorch/pull/142249#issuecomment-2524991008))	2024-12-07 07:43:51 +00:00
Blaine Burton Rister	f870ee2cc4	[Inductor] Represent size_hints as a dict (#142249 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.) # Test plan The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249 Approved by: https://github.com/jansel	2024-12-07 06:43:05 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Xuehai Pan	134bc4fc34	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 07:49:19 +00:00
PyTorch MergeBot	b732b52f1e	Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 )" This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d. Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))	2024-07-18 06:39:58 +00:00
Xuehai Pan	aecc746fcc	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 05:13:41 +00:00
xinan.lin	cc518ebd38	[Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 2) (#124147 ) Reuse Inductor test case for Intel GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124147 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-06-16 08:07:05 +00:00
Matthew Hoffman	81277baa0c	Remove removed ruff rule TRY200 (#126256 ) My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema. From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/ > This rule has been removed and its documentation is only available for historical reasons. > > This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead. and we are currently explicitly ignoring B904. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256 Approved by: https://github.com/Skylion007	2024-05-17 16:31:05 +00:00
Jason Ansel	0093735ccd	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 18:46:40 +00:00
Jason Ansel	0cc0e60e30	[inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124559 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557	2024-04-22 18:46:29 +00:00
PyTorch MergeBot	b3d6c2fe9b	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559 )" This reverts commit 9ea2a0951005c4bcb2491556a8548319c6cccfdb. Reverted https://github.com/pytorch/pytorch/pull/124559 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	30dec1da84	Revert "[inductor] Use compile time config values in runtime (#124561 )" This reverts commit 3af12447f85dfede191a113c052e58fa7b21a8b3. Reverted https://github.com/pytorch/pytorch/pull/124561 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124561#issuecomment-2070537634))	2024-04-22 18:24:38 +00:00
Jason Ansel	3af12447f8	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 04:51:30 +00:00
Jason Ansel	9ea2a09510	[inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124559 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557	2024-04-22 04:51:20 +00:00
Shunting Zhang	b71423c2e4	[inductor] let coordesc tuner respect max RBLOCK (#124325 ) Fix https://github.com/pytorch/pytorch/issues/124251 . Coordesc tuner need respect max RBLOCK. When rnumel is a multiple of max-RBLOCK, inductor codegen will skip rmask. If coordesc tuner does not consider max-RBLOCK and pick a RBLOCK larger than that, we would get CUDA IMA (illegal memory access) error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124325 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-04-18 02:12:35 +00:00
Sam Larsen	4cd503c1f3	Enable FX graph cache for a batch of inductor tests (#121696 ) Summary: Get more FX graph cache coverage by enabling it for these unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696 Approved by: https://github.com/eellison	2024-03-14 03:39:59 +00:00
PyTorch MergeBot	1e60174891	Revert "[dynamo] Add run_inductor_tests entrypoint (#113278 )" This reverts commit b00311ce9e430cf1b98d2103e21ed2179450a424. Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))	2023-11-15 01:19:48 +00:00
Jason Ansel	b00311ce9e	[dynamo] Add run_inductor_tests entrypoint (#113278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278 Approved by: https://github.com/yanboliang	2023-11-11 08:54:43 +00:00
Aaron Gokaslan	cb856b08b2	[BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496 ) Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496 Approved by: https://github.com/malfet	2023-10-19 21:56:36 +00:00
Shunting Zhang	a358a9262e	[inductur] coordesc tuner bug fix with no_x_dim kernel (#104692 ) We recently have an optimization to squash x dimension for persistent reduction kernel when we are confident that XBLOCK will always be 1. We need update the code so that coordinate descent tuner does not tune XBLOCK in this case. Test command. Fail before the fix and pass after. ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only BertForMaskedLM --inference ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104692 Approved by: https://github.com/jansel	2023-07-06 17:47:02 +00:00
Shunting Zhang	298ff41a38	[inductor] fix a bug in coordinate descent tuner (#104293 ) The neighbor values we try for a field can be empty in some corner cases. ``` # E.g., if XBLOCK is 1 initially and size_hint for x is also 1. # We would not try either larger or smaller XBLOCK in this case. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104293 Approved by: https://github.com/jansel	2023-06-28 00:05:13 +00:00
Jack Taylor	ede1965f5d	Enable additional inductor test suites on ROCm (#102270 ) Enables additional inductor UTs on ROCm, following from https://github.com/pytorch/pytorch/pull/100981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102270 Approved by: https://github.com/malfet	2023-06-22 00:36:35 +00:00
Shunting Zhang	6095a22cff	[inductor] add the ability to do heavier search for coordinate descent tuning (#99403 ) When checking Meta's internal cmf10x model, I found this interesting kernel https://gist.github.com/shunting314/d4b1fc7352c840ef185c607392e21f31 . Doing coordinate descent tuning starting from the out of box tuner find sub-optimal config: a config worse than the best one max-autotuner can find. This indicates that the coordinate descent tuner does not necessarily find the optimal config. Starting point matters. I want to make the coordinate descent tuning less depend on the starting point. Also I think by improving that, the coordinate descent tuner may be more likely to find even better configs when starting from max-autotune result. There are 2 ideas. 1. currently coordinate descent tuning only considers changing one field/coordinate at a time. I add the ability to check all directions (i.e. tuning all tunable fields at the same time) after the normal coordinate descent searching does not find better choices. I'll check how that works in cmf10x 2. currently when we change a field, we only change 1 step (i.e. radius is 1). I add the ability to use a larger radius. This only affect the search in all directions and does not affect the normal coordinate descent searching workflow. Both are disabled by default. Here are the tests I've done: - OOB (out of the box): 0.083ms 0.003GB 38.13GB/s - MA (max autotune): 0.016ms 0.003GB 195.60GB/s - best config: XBLOCK: 4, RBLOCK: 128, num_warps: 4, num_stages: 1 Default coordinate descent: - Coordesc (coordinate descent tuner) upon OOB: 0.024ms 0.003GB 131.52GB/s ( WORSE than Max Autotune ) - best config: XBLOCK: 64, RBLOCK: 4, num_warps: 16, num_stages: 1 - Coordesc upon MA: 0.016ms 0.003GB 194.31GB/s (no further improvement upon MA) Search in all directions: (radius = 1) - Coordesc upon OOB: 0.017ms 0.003GB 184.55GB/s - best config: XBLOCK: 32, RBLOCK: 16, num_warps: 32, num_stages: 1 - IMPROVE FROM 0.024ms to 0.017ms. QUITE CLOSE TO THE ONE FIND BY MAX-AUTOTUNE - Coordesc upon MA: no further improvements upon MA Search in all directions: (radius = 2) - Coordesc upon OOB: 0.016ms 0.003GB 192.60GB/s - best config: XBLOCK: 8, RBLOCK: 16, num_warps: 8, num_stages: 1 - SLIGHTLY BETTER THAN RADIUS=1 for this kernel and on par with max-autotune - Coordesc upon MA: no further improvements upon MA Overall max-autotuner does a really good job for this kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/99403 Approved by: https://github.com/jansel	2023-06-09 09:04:55 +00:00
Jack Taylor	187eb7ca88	Enable default workflow PyT 2.0 UTs on ROCm stack (#100981 ) PR to enable default workflow PyTorch 2.0 unit tests for the ROCm stack. - Enables all the dynamo unit test suites - Enables some of the inductor unit test suites - `test_config` - `test_cpp_wrapper` (cpu only) - `test_minifier` - `test_standalone_compile` - `test_torchinductor_dynamic_shapes` - `test_torchinductor_opinfo` - `test_torchinductor` - `test_triton_wrapper` - Introduces TEST_WITH_ROCM conditions for unit test skip/fail dictionaries in test_torchinductor_dynamic_shapes.py and test_torchinductor_opinfo.py Note this PR follows on from the discussions for the previous UT enablement PR https://github.com/pytorch/pytorch/pull/97988, we have opted to only enable a few inductor suites at the moment to ease the upstreaming effort as these files are changing very quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100981 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-05-15 23:45:04 +00:00
Shunting Zhang	418a9fb9d8	[reland][inductor] coordinate descent tuning upon max-autotune (#99594 ) Reland https://github.com/pytorch/pytorch/pull/97203 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/99594 Approved by: https://github.com/jansel	2023-04-20 19:55:52 +00:00
PyTorch MergeBot	4aedb8e116	Revert "[inductor] coordinate descent tuning upon max-autotune (#97203 )" This reverts commit 52ecc3274b1c16fcca3a3d89bd261dbc6513d6ed. Reverted https://github.com/pytorch/pytorch/pull/97203 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks MacOS test in trunk	2023-04-19 02:33:02 +00:00
Shunting Zhang	52ecc3274b	[inductor] coordinate descent tuning upon max-autotune (#97203 ) Command to run max autotune baseline: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) ``` Command to do coordinate descent autotuning: ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) ``` Explanation of the envvars show up on the command: ``` - TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 : enable coordinate descent tuning - TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 : disable persistent reduction. Need do this so we can tune RBLOCK for reductions - TORCHINDUCTOR_MAX_AUTOTUNE=1: enable max autotune - TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc : use a separate cache dir for coordinate descent tuning. Optional. ``` Here are my experiments results for around 40 torchbench models: https://docs.google.com/spreadsheets/d/1G7i2whIf8Yu-HhN_WovNxwcE-iFDSAw4x3NK4uL4XhI/edit#gid=0 Some highlights - We improve 2.2% further upon max-autotune on average (geomean) - timm_resnest benefits most from coordinate descent tuning. There is 1.07x speedup - We have descent speedup on transformer models - BERT_pytorch: 1.056x - timm_vision_transformer: 1.04x - hf_Bert: 1.030x - For resnet models, it looks like we have less gain as model get larger. My guess is larger model spend more time on mm/conv, so our tuning for pointwise/reduction helps less - resnet18: 1.021x - resnet50: 1.014x - resnet152: 1.005x This kind of coordinate descent autotuning can give us 'upper bound' of the gain we can get for tuning configs for pointwise/reduction. On the other hand, by spot checking, we roughly double the compilation time compared to max-autotune. Next steps can be - we disable persistent reduction in coordinate descent autotune (it's still enabled in baseline) so we can tune RBLOCK for reduction. We can also try to use autotune to pick persistent reduction or not. - pick good config without benchmarking (e.g. Natalia mentioned checking register spill) - try the idea on matmul so we know what's the potential there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97203 Approved by: https://github.com/ngimel	2023-04-19 00:17:10 +00:00

29 Commits