This adds logging that will mark any invocation of a matmul for a particular input shapes, and record every template configs performance on it. Then, we can parse that into a script which will minimize the total mm execution time given N allowed templates. And in future, other experiments..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126560
Approved by: https://github.com/nmacchioni, https://github.com/jansel