Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU
(Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped)
- Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2` otherwise they GPU tests are just running for _cpu suffixes.
- Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0`
- UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU
- Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel
Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug
Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Differential Revision: D55103303
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245
Approved by: https://github.com/jackiexu1992
Improve performance of inductor searching large graphs for potential fusions.
Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior.
Fixes#98467
Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration).
Fusion is still slow - but at least finishes.
After this change the example given in #98467 has the following backend timings (on one particular CPU):
eager timing: 3m:23s
aot_eager timing: 4m:12s
inductor timing: 22m:24s
Possible future work to improve this further:
1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph.
2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324
Approved by: https://github.com/oulgen
Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times
Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281
Approved by: https://github.com/ezyang
Summary:
As discussed D51695982, fusion may not be always good. We want to let the user customize the fx passes.
Some example for new configs:
* Use batch_fusion config: this will automatically use the following batch fusions, including batch linear, layernorm, relu, tanh, sigmoid and post grad batch linear fusion
* use config:
```
"pre_grad_fusion_options": {
"batch_linear": {"min_fuse_set_size": 10},
"batch_linear_lhs": {},
"batch_layernorm": {"max_fuse_search_depth": 100},
"batch_tanh": {},
"batch_relu": {},
"batch_sigmoid": {}
},
```
Test Plan:
with flag: f509168388
with config: f509168595
Reviewed By: frank-wei, mengluy0125
Differential Revision: D51817314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115128
Approved by: https://github.com/mengluy0125
Summary:
We did two things:
1. We add back the batch_fusion and group_fusion flags to keep the current production model implementation
2. We tell batch and group fusion in the post grad since group need fbgemm.
Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/13d152d2-5d4d-4c7a-ab88-51f8e8218942
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900253044737
Network: Up: 376KiB Down: 44KiB (reSessionID-c508aedc-8cc2-434a-8c17-bbe075a05562)
Jobs completed: 17. Time elapsed: 1:23.1s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
Differential Revision: D51695982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114841
Approved by: https://github.com/jackiexu1992
Meta internal customers need more flexible configs on these group/batch fusion's execution order and parameters, I'd like to provide a new inductor config that users can fine and auto tune these group/batch fusions for different models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113738
Approved by: https://github.com/xuzhao9
Summary:
Daohang report this pattern in f469463749
{F1074472207}
{F1074473348}
Hence, we can fuse the tanh after same split.
Typically the pattern looks like split->getitem0,...n-> tanh(geitem 0,..., n). Hence, we search for parent node of tahn nodes and the node should be getitem(parent, index). If tanh is after same split node, parent nodes of getitem nodes should be same.
Test Plan:
```
[jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (c78736187)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/df87affc-d294-4663-a50d-ebb71b98070d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149208311124
Network: Up: 0B Down: 0B
Jobs completed: 16. Time elapsed: 1:19.9s.
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D48581140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107881
Approved by: https://github.com/yanboliang
Summary:
After we compile dense arch, we observe split-linear-cat pattern. Hence, we want to use bmm fusion + split cat pass to fuse the pattern as torch.baddmm.
Some explanation why we prefer pre grad:
1) We need to add bmm fusion before split cat pass which is in pre grad pass to remove the new added stack and unbind node with the original cat/split node
2) Post grad does not support torch.stack/unbind. There is a hacky workaround but may not be landed in short time.
Test Plan:
# unit test
```
buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
[jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (f0ff3e3fc)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/189dd467-d04d-43e5-b52d-d3b8691289de
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5910974704097734
Network: Up: 0B Down: 0B
Jobs completed: 14. Time elapsed: 1:05.4s.
Tests finished: Pass 5. Fail 0. Fatal 0. Skip 0. Build failure 0
```
# local test
```
=================Single run start========================
enable split_cat_pass for control group
================latency analysis============================
latency is : 73.79508209228516 ms
=================Single run start========================
enable batch fusion for control group
enable split_cat_pass for control group
================latency analysis============================
latency is : 67.94447326660156 ms
```
# e2e test
todo add e2e test
Differential Revision: D48539721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107759
Approved by: https://github.com/yanboliang
Summary:
Major changes:
* Implement a new group/batch fusion pattern searching algorithm: only fuse patterns that are in a certain depth difference (locally).
* Search FX graph in reverse order since most of ops have more inputs than outputs.
* Support fuse mm (linear backward)
* Preserve memory layout for fbgemm.gmm.
We tested in Ads models and saw consistent gains.
Test Plan: Unit tests and integration test.
Differential Revision: D47581710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106279
Approved by: https://github.com/jansel, https://github.com/Skylion007
Summary:
The draft version of a group + batch fusion framework, and the group linear fusion implementation.
In the future, it's pretty straightforward to add a new group/batch fusion policy by defining a class with match + fuse functions.
Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
Differential Revision: D46956695
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105116
Approved by: https://github.com/jansel