[Triton] [Inductor] Restrict subprocess autotuning to just Triton (#162688)

Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`.

Test Plan:
mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`

Rollback Plan:

Differential Revision: D82181924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162688
Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
This commit is contained in:
Nick Riasanovsky
2025-09-11 22:17:57 +00:00
committed by PyTorch MergeBot
parent 468c1f9e9d
commit 082d3dd9d5

View File

@ -3032,8 +3032,13 @@ class AlgorithmSelectorCache(PersistentCache):
# only benchmark triton kernel in sub process for now.
# ATen/Extern kernel are still benchmarked in the current process.
extern = [c for c in choices if isinstance(c, ExternKernelCaller)]
triton = [c for c in choices if not isinstance(c, ExternKernelCaller)]
extern = []
triton = []
for c in choices:
if isinstance(c, TritonTemplateCaller):
triton.append(c)
else:
extern.append(c)
timings = cls.benchmark_in_current_process(
extern, input_nodes, layout, input_gen_fns, hint_override=hint_override