49 Commits

Author SHA1 Message Date
a029675f6f More ruff SIM fixes (#164695)
This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695
Approved by: https://github.com/ezyang
2025-10-09 03:24:50 +00:00
f0ae3a57f6 [Optimus] Add batch dropout pattern (#162443)
Summary: We observe dropout pattern in AFOC, such add a new pattern to Optimus

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_batch_dropout_pre_grad_fusion
```

Buck UI: https://www.internalfb.com/buck2/2c899fb5-6e8b-43eb-8fb3-b53abfbfa6d9
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598805248688
Network: Up: 0B  Down: 0B  (reSessionID-bfbb9e6a-7e2a-425a-a027-b44282cef419)
Executing actions. Remaining     0/3                                                                                                     1.3s exec time total
Command: test.     Finished 2 local
Time elapsed: 1:22.3s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

baseline
f791163796

proposal
f793225207

Rollback Plan:

Differential Revision: D81981264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162443
Approved by: https://github.com/Yuzhen11, https://github.com/mlazos
2025-09-10 09:49:01 +00:00
393fecb2cc [Optimus][Unit test] clean up the unit test (#158696)
Summary: We should only patch the specific pattern(s) for each unit test.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```

Buck UI: https://www.internalfb.com/buck2/f8d37674-91c4-4244-90fa-f24fc3f91e4b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275088644915
Network: Up: 100KiB  Down: 233KiB  (reSessionID-92039f44-bc6f-4e78-87b1-93bca1bd1c66)
Analyzing targets. Remaining     0/296
Executing actions. Remaining     0/20196                                                                    5.8s exec time total
Command: test.     Finished 2 local, 2 cache (50% hit)                                                      4.6s exec time cached (79%)
Time elapsed: 3:55.1s
Tests finished: Pass 13. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D78598127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158696
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-07-21 18:05:09 +00:00
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
eb2af14f8e [PT2][partitioners] Add aten.split to view_ops list [relanding #155424] (#155943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155943
Approved by: https://github.com/ShatianWang
2025-06-16 20:42:54 +00:00
8372d0986a Revert "[PT2][partitioners] Add aten.split to view_ops list (#155424)"
This reverts commit e1db10e05aa720aef1989773adcf48f311bcf920.

Reverted https://github.com/pytorch/pytorch/pull/155424 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_repro.py::CPUReproTests::test_transpose_with_norm [GH job link](https://github.com/pytorch/pytorch/actions/runs/15596830833/job/43931044625) [HUD commit link](e1db10e05a) but idk how, reverting to see if it fixes the problem ([comment](https://github.com/pytorch/pytorch/pull/155424#issuecomment-2964717706))
2025-06-12 01:38:34 +00:00
e1db10e05a [PT2][partitioners] Add aten.split to view_ops list (#155424)
Summary: Add `aten.split` to view_ops list in partitioners.py

Test Plan:
na

Rollback Plan:

Differential Revision: D76011951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155424
Approved by: https://github.com/xuanzhang816
2025-06-11 22:12:13 +00:00
492f3fd5cf replace usages of upload_graph in inductor with tlparse (v2) (#148720)
Reland of https://github.com/pytorch/pytorch/pull/148703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720
Approved by: https://github.com/mengluy0125
2025-03-10 22:47:58 +00:00
6a985d8b2e Make inductor_utils.requires_gpu accept MPS (#145156)
Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU
(Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped)

- Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2`  otherwise they GPU tests are just running for _cpu suffixes.
- Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0`
- UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU
- Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel
2025-02-06 01:14:36 +00:00
99dbc5b0e2 PEP585 update - test (#145176)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176
Approved by: https://github.com/bobrenjc93
2025-01-22 04:48:28 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
358ff3b731 [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 1) (#136069)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_autoheuristic.py`
reuse `test/inductor/test_b2b_gemm.py`
reuse `test/inductor/test_custom_lowering.py`
reuse `test/inductor/test_efficient_conv_bn_eval.py`
reuse `test/inductor/test_group_batch_fusion.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136069
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
2024-10-18 16:58:09 +00:00
dd7c2899bd [dynamo] Properly prune dead cell local variables (#136891)
This patch updates the `prune_dead_locals` logic to do slightly more aggressive pruning for cell local variables, in absence of side-effects, e.g., a cell variable can be pruned when its user function(s) will never be used again.

See added tests for examples; note that a few tests in `test/dynamo/test_higher_order_ops.py` also got updated because we are no longer returning the unnecessary graph output.

Fixes #127350, #124653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136891
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
2024-10-10 18:21:24 +00:00
5e73f2d7c0 [PT2][Dynamo][Optimus] Add batch detach, clamp and nan_to_num in pre grad (#137415)
Test Plan:
# unit test
```
CUDA_VISIBLE_DEVICES=4 OC_CAUSE=1 buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_math_op_fusion
```

Buck UI: https://www.internalfb.com/buck2/185799e1-6ea8-4bd1-b2e1-0c1a8dd92f89
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275044114335
Network: Up: 14KiB  Down: 287B  (reSessionID-d24cee56-2a22-4a90-b4c6-1d0c3ab256c1)
Jobs completed: 8. Time elapsed: 48.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt scripts/shuaiyang:test -- --optimus --flow_id 648108097 2>&1 | tee ~/local_run_shuai_interformer_cmf.txt
```

Counter({'pattern_matcher_nodes': 6626, 'pattern_matcher_count': 6396, 'extern_calls': 5340, 'benchmarking.TritonBenchmarker.benchmark_gpu': 2710, 'normalization_pass': 44, 'fxgraph_cache_miss': 37, 'scmerge_split_removed': 16, 'scmerge_cat_removed': 16, 'unbind_stack_pass': 16, 'batch_aten_mul': 15, 'batch_linear_post_grad': 12, 'batch_linear': 5, 'batch_detach': 4, 'batch_nan_to_num': 4, 'batch_clamp': 4, 'batch_aten_add': 4, 'batch_layernorm': 2, 'scmerge_cat_added': 2, 'batch_sigmoid': 1, 'scmerge_split_sections_removed': 1, 'unbind_stack_to_slices_pass': 1, 'benchmarking.TritonBenchmarker.triton_do_bench': 1, 'scmerge_split_added': 1, 'fxgraph_cache_hit': 1, 'batch_aten_sub': 1})

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-10-06-20-53-01/trace.json.gz&bucket=gpu_traces

# e2e

baseline:
f650336422

proposal:

f650336607

### QPS and NE results

 {F1914975940}
{F1914975938}
{F1914975939}
{F1914975945}

> 0.7% QPS gain with NE neutral

### trace analysis

Before
 {F1914990600}

After

{F1914990015}

We reduced green part in the trace introduced by small nan_to_num kernels

Differential Revision: D63962711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137415
Approved by: https://github.com/Yuzhen11
2024-10-10 18:11:08 +00:00
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
920f0426ae Add None return type to init -- tests rest (#132376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335, #132351, #132352
2024-08-01 15:44:51 +00:00
134bc4fc34 [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 07:49:19 +00:00
b732b52f1e Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)"
This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d.

Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))
2024-07-18 06:39:58 +00:00
aecc746fcc [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 05:13:41 +00:00
f264745ff1 [interformer] batch pointwise op + unbind stack pass in post grad (#126959)
Summary: Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068

config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |

Differential Revision: D57595173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126959
Approved by: https://github.com/jackiexu1992
2024-05-31 03:54:43 +00:00
51e707650f Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-22 17:28:46 +00:00
8a4597980c Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)"
This reverts commit 831efeeadf5fa8d9e7f973057e634a57e3bcf04b.

Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
831efeeadf Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-20 23:40:56 +00:00
4670dcc94c [Inductor]Fix a couple of broken unit tests (#122714)
Summary: Titled

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/ad05a43c-cb4a-443e-8904-b4d53e4f4b1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798909218388
Network: Up: 107KiB  Down: 28KiB  (reSessionID-d7146e4f-773a-46ea-9852-f10f59302479)
Jobs completed: 24. Time elapsed: 1:49.3s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor/fb:split_cat_fx_passes_fb
```

Buck UI: https://www.internalfb.com/buck2/82dbf3b0-c747-4c07-98b8-53b69afa3157
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900267699118
Network: Up: 1.4GiB  Down: 2.3GiB  (reSessionID-0bd22c6d-5dfe-4b4a-bc24-705eadac884b)
Jobs completed: 252570. Time elapsed: 7:25.2s.
Cache hits: 95%. Commands: 123778 (cached: 117999, remote: 2779, local: 3000)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D55378009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122714
Approved by: https://github.com/SherlockNoMad
2024-03-28 17:44:30 +00:00
d2a8d3864c [PT2][Inductor] Change the log for the group batch fusion (#122245)
Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```

Differential Revision: D55103303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245
Approved by: https://github.com/jackiexu1992
2024-03-20 20:45:37 +00:00
a17cd226d6 [inductor] Enable FX graph caching on another round of inductor tests (#121994)
Summary: Enabling caching for these tests was blocked by https://github.com/pytorch/pytorch/pull/121686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121994
Approved by: https://github.com/eellison
2024-03-18 20:55:18 +00:00
7b1f5c874f [PT2][Optimus][Observability] Log the optimus graph transformation to the scuba (#119745)
Summary: Current everstore upload logging may cuase excessive compilation time when the model has lots of graph breaks (post: https://fb.workplace.com/groups/257735836456307/permalink/633533465543207/), we here log the transformation only when the graph changed

Test Plan:
timeout flows:
f528209775
f530084719

Differential Revision: D53692344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119745
Approved by: https://github.com/jackiexu1992
2024-02-16 21:32:04 +00:00
e9b78f2db0 Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324)
Improve performance of inductor searching large graphs for potential fusions.
Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior.

Fixes #98467

Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration).

Fusion is still slow - but at least finishes.

After this change the example given in #98467 has the following backend timings (on one particular CPU):
eager timing: 3m:23s
aot_eager timing: 4m:12s
inductor timing: 22m:24s

Possible future work to improve this further:
1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph.
2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324
Approved by: https://github.com/oulgen
2024-02-13 22:54:53 +00:00
865945cc1f Convert requires_cuda to full decorator (#118281)
Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times

Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281
Approved by: https://github.com/ezyang
2024-01-25 15:50:21 +00:00
12d7ea19af [Indcutor][fx pass] Add sub and div pointwise ops to the post grad fusion (#115389)
Summary: Titled

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/792c58db-c369-487d-9a42-b5da471657c0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814749981661407
Network: Up: 74KiB  Down: 29KiB  (reSessionID-b47c266b-12d6-4e88-8dc3-4af1dd7ecbb4)
Jobs completed: 20. Time elapsed: 2:09.6s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce
OC: P899142918
MAI: P899175452
# e2e (oc)

Differential Revision: D51957242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115389
Approved by: https://github.com/dshi7, https://github.com/jackiexu1992, https://github.com/xuzhao9
2023-12-08 21:07:03 +00:00
58809e8914 [Inductor][Optimus]Move group/batch fusion logic out of inductor (#115128)
Summary:
As discussed D51695982, fusion may not be always good. We want to let the user customize the fx passes.

Some example for new configs:
* Use batch_fusion config: this will automatically use the following batch fusions, including batch linear, layernorm, relu, tanh, sigmoid and post grad batch linear fusion
* use config:
```
"pre_grad_fusion_options": {
            "batch_linear": {"min_fuse_set_size": 10},
            "batch_linear_lhs": {},
            "batch_layernorm": {"max_fuse_search_depth": 100},
            "batch_tanh": {},
            "batch_relu": {},
            "batch_sigmoid": {}
          },
```

Test Plan:
with flag: f509168388

with config: f509168595

Reviewed By: frank-wei, mengluy0125

Differential Revision: D51817314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115128
Approved by: https://github.com/mengluy0125
2023-12-05 08:19:17 +00:00
50833021dd [Inductor] We re-enable the batch_fusion and group_fusion flags in order not to disturb the current production model implementation (#114841)
Summary:
We did two things:
1. We add back the batch_fusion and group_fusion flags to keep the current production model implementation

2. We tell batch and group fusion in the post grad since group need fbgemm.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/13d152d2-5d4d-4c7a-ab88-51f8e8218942
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900253044737
Network: Up: 376KiB  Down: 44KiB  (reSessionID-c508aedc-8cc2-434a-8c17-bbe075a05562)
Jobs completed: 17. Time elapsed: 1:23.1s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D51695982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114841
Approved by: https://github.com/jackiexu1992
2023-12-03 23:59:10 +00:00
1bcefaf575 [inductor] post_grad batched linear fusion (#112504)
Summary: Fusing independent nn.Linear() functions with aten.bmm and aten.cat.

Test Plan:
Without the BMM fusion:
```
buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces

100 aten::mm operators

With the BMM fusion:
```
buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1
```

20 aten::bmm operators

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces

Passes accuracy test:
```
$ buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy
Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```
Looks like the bmm and input cat has been fused successfully.

Checking the triton codegen:

```
TORCH_LOGS=+dynamo,+aot,+inductor buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1
```

Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112504
Approved by: https://github.com/yanboliang
2023-12-01 19:26:29 +00:00
c1f7d4ad6a [Inductor][fx pass] Refactor code to easily add pointwise op to do the batch fusion (#113381)
Summary:
1. We refactor the code to have a unified API to add pointwise op

2. Add one more op sigmoid since we observed it in MC models

Test Plan:
# local reproduce for CMF

```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch -c
```
P876977403
P876996776

diffing: https://www.internalfb.com/intern/diffing/?paste_number=876999623

Differential Revision: D51142990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113381
Approved by: https://github.com/xuzhao9
2023-11-29 18:29:57 +00:00
9f0deb132b [Inductor] Refactor group/batch fusion to support user defined execution order and configs (#113738)
Meta internal customers need more flexible configs on these group/batch fusion's execution order and parameters, I'd like to provide a new inductor config that users can fine and auto tune these group/batch fusions for different models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113738
Approved by: https://github.com/xuzhao9
2023-11-22 05:46:23 +00:00
6bffde99b0 Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit 66d09f82170c528698b5ec606ba7838268ae1f8a.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1811666004))
2023-11-15 01:44:26 +00:00
1e60174891 Revert "[dynamo] Add run_inductor_tests entrypoint (#113278)"
This reverts commit b00311ce9e430cf1b98d2103e21ed2179450a424.

Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))
2023-11-15 01:19:48 +00:00
b00311ce9e [dynamo] Add run_inductor_tests entrypoint (#113278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278
Approved by: https://github.com/yanboliang
2023-11-11 08:54:43 +00:00
66d09f8217 [inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)
This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275
Approved by: https://github.com/yanboliang
ghstack dependencies: #113242
2023-11-11 03:17:35 +00:00
68bf0f1e7d Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit c967dc526a40f4b15003f9c99383acabe66367a6.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/PaliC due to the diff this is stacked on top of appears to be causing inductor failures internally ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1805131017))
2023-11-10 05:40:55 +00:00
c967dc526a [inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)
This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275
Approved by: https://github.com/yanboliang
2023-11-10 00:11:09 +00:00
2b952834c7 [pytorch][PR] [Inductor][FX passes] Pre grad batch relu fusion (#111146)
Summary: We detect independent relu operators and do the fusion in the pre grad.

Test Plan:
### unit test
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498608558485

### Inlinve cvr
f479655232
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch_group
```
before vs after transformation
https://www.internalfb.com/intern/diffing/?paste_number=851907099

```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch_group -c
```

P852036786

Differential Revision: D50207610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111146
Approved by: https://github.com/yanboliang
2023-10-25 17:37:39 +00:00
fc33dc014a [inductor][fx passes] batch tanh in pre grad (#107881)
Summary:
Daohang report this pattern in f469463749
{F1074472207}
 {F1074473348}
Hence, we can fuse the tanh after same split.
Typically the pattern looks like split->getitem0,...n-> tanh(geitem 0,..., n). Hence, we search for parent node of tahn nodes and the node should be getitem(parent, index). If tanh is after same split node, parent nodes of getitem nodes should be same.

Test Plan:
```
[jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (c78736187)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/df87affc-d294-4663-a50d-ebb71b98070d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149208311124
Network: Up: 0B  Down: 0B
Jobs completed: 16. Time elapsed: 1:19.9s.
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D48581140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107881
Approved by: https://github.com/yanboliang
2023-08-25 03:02:30 +00:00
9bda8f1e16 [inductor][fx passes]batch linear in pre grad (#107759)
Summary:
After we compile dense arch, we observe split-linear-cat pattern. Hence, we want to use bmm fusion + split cat pass to fuse the pattern as torch.baddmm.
Some explanation why we prefer pre grad:
1) We need to add bmm fusion before split cat pass which is in pre grad pass to remove the new added stack and unbind node with the original cat/split node
2) Post grad does not support torch.stack/unbind. There is a hacky workaround but may not be landed in short time.

Test Plan:
# unit test
```
buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
[jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (f0ff3e3fc)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/189dd467-d04d-43e5-b52d-d3b8691289de
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5910974704097734
Network: Up: 0B  Down: 0B
Jobs completed: 14. Time elapsed: 1:05.4s.
Tests finished: Pass 5. Fail 0. Fatal 0. Skip 0. Build failure 0
```
# local test
```
=================Single run start========================
enable split_cat_pass for control group
================latency analysis============================
latency is : 73.79508209228516 ms

=================Single run start========================
enable batch fusion for control group
enable split_cat_pass for control group
================latency analysis============================
latency is : 67.94447326660156 ms
```
# e2e test
todo add e2e test

Differential Revision: D48539721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107759
Approved by: https://github.com/yanboliang
2023-08-24 03:42:09 +00:00
e34a05b960 [ez][inductor][fx pass] strengthen numerical check for batch fusion (#106744)
Summary:
As title.
For batch fusion, we use torch op to fuse and the result should be exactly same as the original ones.
pull request: https://github.com/pytorch/pytorch/pull/106731#issuecomment-1668662078

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
File changed: fbsource//xplat/caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/cf14a2dd-faee-417a-8d26-0b9326c944e4
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6755399617159540
Network: Up: 0B  Down: 0B
Jobs completed: 12. Time elapsed: 2:55.5s.
Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Reviewed By: dshi7

Differential Revision: D48132255

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106744
Approved by: https://github.com/kit1980
2023-08-10 03:49:23 +00:00
416bf4e3e7 [Inductor][FX passes] Pre grad batch linear LHS fusion (#106497)
This is a popular pattern in many internal user cases, we have two versions (pre and post grad) and found the pre grad version has more perf gain, which makes sense in theory as this corresponding backward graph doesn't have this pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106497
Approved by: https://github.com/jackiexu1992, https://github.com/jansel
2023-08-07 05:52:27 +00:00
03e85be9b0 [Inductor][FX passes] New group/batch fusion pattern searching algorithm + group mm fusion + preserve memory layout (#106279)
Summary:

Major changes:
* Implement a new group/batch fusion pattern searching algorithm: only fuse patterns that are in a certain depth difference (locally).
* Search FX graph in reverse order since most of ops have more inputs than outputs.
* Support fuse mm (linear backward)
* Preserve memory layout for fbgemm.gmm.

We tested in Ads models and saw consistent gains.

Test Plan: Unit tests and integration test.

Differential Revision: D47581710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106279
Approved by: https://github.com/jansel, https://github.com/Skylion007
2023-08-01 01:10:44 +00:00
e40f8acef2 [inductor][fx passes] batch layernom (#105492)
Summary: Batch layernorm. Fuse independent horizontal layernorm with same size into one.

Test Plan:
# unit test
```
buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/68eb51e1-bdbc-4847-aabf-e50737d8485b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549764442206
Network: Up: 0 B  Down: 0 B
Jobs completed: 10. Time elapsed: 1:07.2s.
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D47447542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105492
Approved by: https://github.com/jansel, https://github.com/xuzhao9
2023-07-21 05:03:04 +00:00
dc58259746 [Inductor] [FX passes] Group linear fusion (#105116)
Summary:
The draft version of a group + batch fusion framework, and the group linear fusion implementation.
In the future, it's pretty straightforward to add a new group/batch fusion policy by defining a class with match + fuse functions.

Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion

Differential Revision: D46956695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105116
Approved by: https://github.com/jansel
2023-07-18 03:56:42 +00:00