Fixes#154111
Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.
The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
Fixes#154111
Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.
The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
Summary: Run mm decomposition tests for CPU and GPU
One nit - this will suppress CPU tests on hosts that have CUDA (i.e., TEST_CUDA is True), but doesn't have Triton because we don't have access to whether the test is actually for CPU or CUDA (which would require reading the device argument)
(This is a general limitation on torch.compile tests because on CUDA they require triton in the std config.)
Test Plan: sandcastle, github
Differential Revision: D48998215
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108620
Approved by: https://github.com/bertmaher