[inductor] Expand use of generic benchmark function (#164938)

Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164938 Approved by: https://github.com/nmacchioni, https://github.com/eellison
2025-11-04 16:04:58 +08:00 · 2025-10-15 09:18:24 +00:00
parent 0c14f55de6
commit 5c583e2573
10 changed files with 103 additions and 45 deletions
--- a/torch/_inductor/select_algorithm.py
+++ b/torch/_inductor/select_algorithm.py
@ -2671,8 +2671,10 @@ class AlgorithmSelectorCache(PersistentCache):

        # Templates selected with input_gen_fns require specific input data to avoid IMA
        # Passing custom input gen fns to benchmark_fusion NYI, so skip deferred template selection
-        # TODO(jgong5): support multi-template on CPU
-        if input_gen_fns is not None or layout.device.type == "cpu":
+        # TODO(jgong5): support multi-template on CPU C++ backend
+        if input_gen_fns is not None or (
+            layout.device.type == "cpu" and config.cpu_backend != "triton"
+        ):
            return_multi_template = False

        # TODO - assert that we have not mutating kernels here