pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	a029675f6f	More ruff SIM fixes (#164695 ) This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695 Approved by: https://github.com/ezyang	2025-10-09 03:24:50 +00:00
Menglu Yu	f0ae3a57f6	[Optimus] Add batch dropout pattern (#162443 ) Summary: We observe dropout pattern in AFOC, such add a new pattern to Optimus Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_batch_dropout_pre_grad_fusion ``` Buck UI: https://www.internalfb.com/buck2/2c899fb5-6e8b-43eb-8fb3-b53abfbfa6d9 Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598805248688 Network: Up: 0B Down: 0B (reSessionID-bfbb9e6a-7e2a-425a-a027-b44282cef419) Executing actions. Remaining 0/3 1.3s exec time total Command: test. Finished 2 local Time elapsed: 1:22.3s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ### E2E baseline f791163796 proposal f793225207 Rollback Plan: Differential Revision: D81981264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162443 Approved by: https://github.com/Yuzhen11, https://github.com/mlazos	2025-09-10 09:49:01 +00:00
Menglu Yu	393fecb2cc	[Optimus][Unit test] clean up the unit test (#158696 ) Summary: We should only patch the specific pattern(s) for each unit test. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion ``` Buck UI: https://www.internalfb.com/buck2/f8d37674-91c4-4244-90fa-f24fc3f91e4b Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275088644915 Network: Up: 100KiB Down: 233KiB (reSessionID-92039f44-bc6f-4e78-87b1-93bca1bd1c66) Analyzing targets. Remaining 0/296 Executing actions. Remaining 0/20196 5.8s exec time total Command: test. Finished 2 local, 2 cache (50% hit) 4.6s exec time cached (79%) Time elapsed: 3:55.1s Tests finished: Pass 13. Fail 0. Fatal 0. Skip 0. Build failure 0 Rollback Plan: Differential Revision: D78598127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158696 Approved by: https://github.com/Skylion007, https://github.com/masnesral	2025-07-21 18:05:09 +00:00
Xuehai Pan	17687eb792	[BE][4/6] fix typos in test/ (test/inductor/) (#157638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638 Approved by: https://github.com/yewentao256, https://github.com/jansel	2025-07-06 06:34:25 +00:00
Xuan Zhang	eb2af14f8e	[PT2][partitioners] Add aten.split to view_ops list [relanding #155424 ] (#155943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155943 Approved by: https://github.com/ShatianWang	2025-06-16 20:42:54 +00:00
PyTorch MergeBot	8372d0986a	Revert "[PT2][partitioners] Add aten.split to view_ops list (#155424 )" This reverts commit e1db10e05aa720aef1989773adcf48f311bcf920. Reverted https://github.com/pytorch/pytorch/pull/155424 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_repro.py::CPUReproTests::test_transpose_with_norm [GH job link](https://github.com/pytorch/pytorch/actions/runs/15596830833/job/43931044625) [HUD commit link](`e1db10e05a`) but idk how, reverting to see if it fixes the problem ([comment](https://github.com/pytorch/pytorch/pull/155424#issuecomment-2964717706))	2025-06-12 01:38:34 +00:00
Shatian Wang	e1db10e05a	[PT2][partitioners] Add aten.split to view_ops list (#155424 ) Summary: Add `aten.split` to view_ops list in partitioners.py Test Plan: na Rollback Plan: Differential Revision: D76011951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155424 Approved by: https://github.com/xuanzhang816	2025-06-11 22:12:13 +00:00
Brian Hirsh	492f3fd5cf	replace usages of upload_graph in inductor with tlparse (v2) (#148720 ) Reland of https://github.com/pytorch/pytorch/pull/148703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720 Approved by: https://github.com/mengluy0125	2025-03-10 22:47:58 +00:00
Nikita Shulga	6a985d8b2e	Make `inductor_utils.requires_gpu` accept MPS (#145156 ) Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU (Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped) - Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2` otherwise they GPU tests are just running for _cpu suffixes. - Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0` - UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU - Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156 Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel	2025-02-06 01:14:36 +00:00
Aaron Orenstein	99dbc5b0e2	PEP585 update - test (#145176 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176 Approved by: https://github.com/bobrenjc93	2025-01-22 04:48:28 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Li, Xingyuan	358ff3b731	[Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 1) (#136069 ) [Inductor UT] Generalize Newly introduced inductor UTs for intel GPU reuse `test/inductor/test_autoheuristic.py` reuse `test/inductor/test_b2b_gemm.py` reuse `test/inductor/test_custom_lowering.py` reuse `test/inductor/test_efficient_conv_bn_eval.py` reuse `test/inductor/test_group_batch_fusion.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136069 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel	2024-10-18 16:58:09 +00:00
Ryan Guo	dd7c2899bd	[dynamo] Properly prune dead cell local variables (#136891 ) This patch updates the `prune_dead_locals` logic to do slightly more aggressive pruning for cell local variables, in absence of side-effects, e.g., a cell variable can be pruned when its user function(s) will never be used again. See added tests for examples; note that a few tests in `test/dynamo/test_higher_order_ops.py` also got updated because we are no longer returning the unnecessary graph output. Fixes #127350, #124653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136891 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519	2024-10-10 18:21:24 +00:00
Menglu Yu	5e73f2d7c0	[PT2][Dynamo][Optimus] Add batch detach, clamp and nan_to_num in pre grad (#137415 ) Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=4 OC_CAUSE=1 buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_math_op_fusion ``` Buck UI: https://www.internalfb.com/buck2/185799e1-6ea8-4bd1-b2e1-0c1a8dd92f89 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275044114335 Network: Up: 14KiB Down: 287B (reSessionID-d24cee56-2a22-4a90-b4c6-1d0c3ab256c1) Jobs completed: 8. Time elapsed: 48.8s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt scripts/shuaiyang:test -- --optimus --flow_id 648108097 2>&1 \| tee ~/local_run_shuai_interformer_cmf.txt ``` Counter({'pattern_matcher_nodes': 6626, 'pattern_matcher_count': 6396, 'extern_calls': 5340, 'benchmarking.TritonBenchmarker.benchmark_gpu': 2710, 'normalization_pass': 44, 'fxgraph_cache_miss': 37, 'scmerge_split_removed': 16, 'scmerge_cat_removed': 16, 'unbind_stack_pass': 16, 'batch_aten_mul': 15, 'batch_linear_post_grad': 12, 'batch_linear': 5, 'batch_detach': 4, 'batch_nan_to_num': 4, 'batch_clamp': 4, 'batch_aten_add': 4, 'batch_layernorm': 2, 'scmerge_cat_added': 2, 'batch_sigmoid': 1, 'scmerge_split_sections_removed': 1, 'unbind_stack_to_slices_pass': 1, 'benchmarking.TritonBenchmarker.triton_do_bench': 1, 'scmerge_split_added': 1, 'fxgraph_cache_hit': 1, 'batch_aten_sub': 1}) https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-10-06-20-53-01/trace.json.gz&bucket=gpu_traces # e2e baseline: f650336422 proposal: f650336607 ### QPS and NE results {F1914975940} {F1914975938} {F1914975939} {F1914975945} > 0.7% QPS gain with NE neutral ### trace analysis Before {F1914990600} After {F1914990015} We reduced green part in the trace introduced by small nan_to_num kernels Differential Revision: D63962711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137415 Approved by: https://github.com/Yuzhen11	2024-10-10 18:11:08 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Oguz Ulgen	920f0426ae	Add None return type to init -- tests rest (#132376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335, #132351, #132352	2024-08-01 15:44:51 +00:00
Xuehai Pan	134bc4fc34	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 07:49:19 +00:00
PyTorch MergeBot	b732b52f1e	Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 )" This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d. Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))	2024-07-18 06:39:58 +00:00
Xuehai Pan	aecc746fcc	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 05:13:41 +00:00
Menglu Yu	f264745ff1	[interformer] batch pointwise op + unbind stack pass in post grad (#126959 ) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 \| Metric \| Value \| \|:-------------------\|:-------------\| \| Latency \| 120.84 ms \| \| Model size \| 5.93 G bytes \| \| Flops/example \| 62.22 GB \| \| TFLOPS \| 32.95 \| \| MFU \| 4.12% \| \| Activation/example \| 128.17 MB \| proposal: P1371676068 config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` \| Metric \| Value \| \|:-------------------\|:-------------\| \| Latency \| 117.30 ms \| \| Model size \| 5.93 G bytes \| \| Flops/example \| 62.65 GB \| \| TFLOPS \| 34.18 \| \| MFU \| 4.27% \| \| Activation/example \| 163.12 MB \| Differential Revision: D57595173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126959 Approved by: https://github.com/jackiexu1992	2024-05-31 03:54:43 +00:00
chilli	51e707650f	Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615 Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan	2024-05-22 17:28:46 +00:00
PyTorch MergeBot	8a4597980c	Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 )" This reverts commit 831efeeadf5fa8d9e7f973057e634a57e3bcf04b. Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
chilli	831efeeadf	Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615 Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan	2024-05-20 23:40:56 +00:00
Menglu Yu	4670dcc94c	[Inductor]Fix a couple of broken unit tests (#122714 ) Summary: Titled Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Buck UI: https://www.internalfb.com/buck2/ad05a43c-cb4a-443e-8904-b4d53e4f4b1e Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798909218388 Network: Up: 107KiB Down: 28KiB (reSessionID-d7146e4f-773a-46ea-9852-f10f59302479) Jobs completed: 24. Time elapsed: 1:49.3s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor/fb:split_cat_fx_passes_fb ``` Buck UI: https://www.internalfb.com/buck2/82dbf3b0-c747-4c07-98b8-53b69afa3157 Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900267699118 Network: Up: 1.4GiB Down: 2.3GiB (reSessionID-0bd22c6d-5dfe-4b4a-bc24-705eadac884b) Jobs completed: 252570. Time elapsed: 7:25.2s. Cache hits: 95%. Commands: 123778 (cached: 117999, remote: 2779, local: 3000) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D55378009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122714 Approved by: https://github.com/SherlockNoMad	2024-03-28 17:44:30 +00:00
Menglu Yu	d2a8d3864c	[PT2][Inductor] Change the log for the group batch fusion (#122245 ) Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Differential Revision: D55103303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245 Approved by: https://github.com/jackiexu1992	2024-03-20 20:45:37 +00:00
Sam Larsen	a17cd226d6	[inductor] Enable FX graph caching on another round of inductor tests (#121994 ) Summary: Enabling caching for these tests was blocked by https://github.com/pytorch/pytorch/pull/121686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121994 Approved by: https://github.com/eellison	2024-03-18 20:55:18 +00:00
Menglu Yu	7b1f5c874f	[PT2][Optimus][Observability] Log the optimus graph transformation to the scuba (#119745 ) Summary: Current everstore upload logging may cuase excessive compilation time when the model has lots of graph breaks (post: https://fb.workplace.com/groups/257735836456307/permalink/633533465543207/), we here log the transformation only when the graph changed Test Plan: timeout flows: f528209775 f530084719 Differential Revision: D53692344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119745 Approved by: https://github.com/jackiexu1992	2024-02-16 21:32:04 +00:00
Aaron Orenstein	e9b78f2db0	Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324 ) Improve performance of inductor searching large graphs for potential fusions. Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior. Fixes #98467 Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration). Fusion is still slow - but at least finishes. After this change the example given in #98467 has the following backend timings (on one particular CPU): eager timing: 3m:23s aot_eager timing: 4m:12s inductor timing: 22m:24s Possible future work to improve this further: 1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph. 2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324 Approved by: https://github.com/oulgen	2024-02-13 22:54:53 +00:00
Alexander Grund	865945cc1f	Convert `requires_cuda` to full decorator (#118281 ) Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281 Approved by: https://github.com/ezyang	2024-01-25 15:50:21 +00:00
Menglu Yu	12d7ea19af	[Indcutor][fx pass] Add sub and div pointwise ops to the post grad fusion (#115389 ) Summary: Titled Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion ``` Buck UI: https://www.internalfb.com/buck2/792c58db-c369-487d-9a42-b5da471657c0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814749981661407 Network: Up: 74KiB Down: 29KiB (reSessionID-b47c266b-12d6-4e88-8dc3-4af1dd7ecbb4) Jobs completed: 20. Time elapsed: 2:09.6s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce OC: P899142918 MAI: P899175452 # e2e (oc) Differential Revision: D51957242 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115389 Approved by: https://github.com/dshi7, https://github.com/jackiexu1992, https://github.com/xuzhao9	2023-12-08 21:07:03 +00:00
Jackie (Jiaqi) Xu	58809e8914	[Inductor][Optimus]Move group/batch fusion logic out of inductor (#115128 ) Summary: As discussed D51695982, fusion may not be always good. We want to let the user customize the fx passes. Some example for new configs: * Use batch_fusion config: this will automatically use the following batch fusions, including batch linear, layernorm, relu, tanh, sigmoid and post grad batch linear fusion * use config: ``` "pre_grad_fusion_options": { "batch_linear": {"min_fuse_set_size": 10}, "batch_linear_lhs": {}, "batch_layernorm": {"max_fuse_search_depth": 100}, "batch_tanh": {}, "batch_relu": {}, "batch_sigmoid": {} }, ``` Test Plan: with flag: f509168388 with config: f509168595 Reviewed By: frank-wei, mengluy0125 Differential Revision: D51817314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115128 Approved by: https://github.com/mengluy0125	2023-12-05 08:19:17 +00:00
Menglu Yu	50833021dd	[Inductor] We re-enable the batch_fusion and group_fusion flags in order not to disturb the current production model implementation (#114841 ) Summary: We did two things: 1. We add back the batch_fusion and group_fusion flags to keep the current production model implementation 2. We tell batch and group fusion in the post grad since group need fbgemm. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion ``` Buck UI: https://www.internalfb.com/buck2/13d152d2-5d4d-4c7a-ab88-51f8e8218942 Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900253044737 Network: Up: 376KiB Down: 44KiB (reSessionID-c508aedc-8cc2-434a-8c17-bbe075a05562) Jobs completed: 17. Time elapsed: 1:23.1s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D51695982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114841 Approved by: https://github.com/jackiexu1992	2023-12-03 23:59:10 +00:00
Xu Zhao	1bcefaf575	[inductor] post_grad batched linear fusion (#112504 ) Summary: Fusing independent nn.Linear() functions with aten.bmm and aten.cat. Test Plan: Without the BMM fusion: ``` buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 0 ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072536_6535183793.json.gz&bucket=pyper_traces 100 aten::mm operators With the BMM fusion: ``` buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 ``` 20 aten::bmm operators https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/test/torchbench_test_module_20231030_072157_6535183793.json.gz&bucket=pyper_traces Passes accuracy test: ``` $ buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --accuracy Running eval method from test_module on cuda in dynamo inductor mode with input batch size 4 and precision tf32. Accuracy: pass ``` Looks like the bmm and input cat has been fused successfully. Checking the triton codegen: ``` TORCH_LOGS=+dynamo,+aot,+inductor buck2 run @mode/opt //pytorch/benchmark:run -- test_module -d cuda --module test_linear_module --torchdynamo inductor --torchinductor_cudagraph 0 --torchinductor_batch_fusion 1 --dump_triton 1 ``` Triton code dump: https://www.internalfb.com/intern/everpaste/?handle=GHp1ABaqYuTjYCUBALiTWmteaI1PbsIXAAAB Pull Request resolved: https://github.com/pytorch/pytorch/pull/112504 Approved by: https://github.com/yanboliang	2023-12-01 19:26:29 +00:00
Menglu Yu	c1f7d4ad6a	[Inductor][fx pass] Refactor code to easily add pointwise op to do the batch fusion (#113381 ) Summary: 1. We refactor the code to have a unified API to add pointwise op 2. Add one more op sigmoid since we observed it in MC models Test Plan: # local reproduce for CMF ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch -c ``` P876977403 P876996776 diffing: https://www.internalfb.com/intern/diffing/?paste_number=876999623 Differential Revision: D51142990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113381 Approved by: https://github.com/xuzhao9	2023-11-29 18:29:57 +00:00
Yanbo Liang	9f0deb132b	[Inductor] Refactor group/batch fusion to support user defined execution order and configs (#113738 ) Meta internal customers need more flexible configs on these group/batch fusion's execution order and parameters, I'd like to provide a new inductor config that users can fine and auto tune these group/batch fusions for different models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113738 Approved by: https://github.com/xuzhao9	2023-11-22 05:46:23 +00:00
PyTorch MergeBot	6bffde99b0	Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275 )" This reverts commit 66d09f82170c528698b5ec606ba7838268ae1f8a. Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1811666004))	2023-11-15 01:44:26 +00:00
PyTorch MergeBot	1e60174891	Revert "[dynamo] Add run_inductor_tests entrypoint (#113278 )" This reverts commit b00311ce9e430cf1b98d2103e21ed2179450a424. Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))	2023-11-15 01:19:48 +00:00
Jason Ansel	b00311ce9e	[dynamo] Add run_inductor_tests entrypoint (#113278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278 Approved by: https://github.com/yanboliang	2023-11-11 08:54:43 +00:00
Jason Ansel	66d09f8217	[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275 ) This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275 Approved by: https://github.com/yanboliang ghstack dependencies: #113242	2023-11-11 03:17:35 +00:00
PyTorch MergeBot	68bf0f1e7d	Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275 )" This reverts commit c967dc526a40f4b15003f9c99383acabe66367a6. Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/PaliC due to the diff this is stacked on top of appears to be causing inductor failures internally ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1805131017))	2023-11-10 05:40:55 +00:00
Jason Ansel	c967dc526a	[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275 ) This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275 Approved by: https://github.com/yanboliang	2023-11-10 00:11:09 +00:00
Menglu Yu	2b952834c7	[pytorch][PR] [Inductor][FX passes] Pre grad batch relu fusion (#111146 ) Summary: We detect independent relu operators and do the fusion in the pre grad. Test Plan: ### unit test ``` buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498608558485 ### Inlinve cvr f479655232 ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch_group ``` before vs after transformation https://www.internalfb.com/intern/diffing/?paste_number=851907099 ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch_group -c ``` P852036786 Differential Revision: D50207610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111146 Approved by: https://github.com/yanboliang	2023-10-25 17:37:39 +00:00
Jackie (Jiaqi) Xu	fc33dc014a	[inductor][fx passes] batch tanh in pre grad (#107881 ) Summary: Daohang report this pattern in f469463749 {F1074472207} {F1074473348} Hence, we can fuse the tanh after same split. Typically the pattern looks like split->getitem0,...n-> tanh(geitem 0,..., n). Hence, we search for parent node of tahn nodes and the node should be getitem(parent, index). If tanh is after same split node, parent nodes of getitem nodes should be same. Test Plan: ``` [jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (c78736187)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py Buck UI: https://www.internalfb.com/buck2/df87affc-d294-4663-a50d-ebb71b98070d Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149208311124 Network: Up: 0B Down: 0B Jobs completed: 16. Time elapsed: 1:19.9s. Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D48581140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107881 Approved by: https://github.com/yanboliang	2023-08-25 03:02:30 +00:00
Jackie (Jiaqi) Xu	9bda8f1e16	[inductor][fx passes]batch linear in pre grad (#107759 ) Summary: After we compile dense arch, we observe split-linear-cat pattern. Hence, we want to use bmm fusion + split cat pass to fuse the pattern as torch.baddmm. Some explanation why we prefer pre grad: 1) We need to add bmm fusion before split cat pass which is in pre grad pass to remove the new added stack and unbind node with the original cat/split node 2) Post grad does not support torch.stack/unbind. There is a hacky workaround but may not be landed in short time. Test Plan: # unit test ``` buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion [jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (f0ff3e3fc)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py Buck UI: https://www.internalfb.com/buck2/189dd467-d04d-43e5-b52d-d3b8691289de Test UI: https://www.internalfb.com/intern/testinfra/testrun/5910974704097734 Network: Up: 0B Down: 0B Jobs completed: 14. Time elapsed: 1:05.4s. Tests finished: Pass 5. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` # local test ``` =================Single run start======================== enable split_cat_pass for control group ================latency analysis============================ latency is : 73.79508209228516 ms =================Single run start======================== enable batch fusion for control group enable split_cat_pass for control group ================latency analysis============================ latency is : 67.94447326660156 ms ``` # e2e test todo add e2e test Differential Revision: D48539721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107759 Approved by: https://github.com/yanboliang	2023-08-24 03:42:09 +00:00
Jackie (Jiaqi) Xu	e34a05b960	[ez][inductor][fx pass] strengthen numerical check for batch fusion (#106744 ) Summary: As title. For batch fusion, we use torch op to fuse and the result should be exactly same as the original ones. pull request: https://github.com/pytorch/pytorch/pull/106731#issuecomment-1668662078 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py File changed: fbsource//xplat/caffe2/test/inductor/test_group_batch_fusion.py Buck UI: https://www.internalfb.com/buck2/cf14a2dd-faee-417a-8d26-0b9326c944e4 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6755399617159540 Network: Up: 0B Down: 0B Jobs completed: 12. Time elapsed: 2:55.5s. Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Reviewed By: dshi7 Differential Revision: D48132255 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/106744 Approved by: https://github.com/kit1980	2023-08-10 03:49:23 +00:00
Yanbo Liang	416bf4e3e7	[Inductor][FX passes] Pre grad batch linear LHS fusion (#106497 ) This is a popular pattern in many internal user cases, we have two versions (pre and post grad) and found the pre grad version has more perf gain, which makes sense in theory as this corresponding backward graph doesn't have this pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106497 Approved by: https://github.com/jackiexu1992, https://github.com/jansel	2023-08-07 05:52:27 +00:00
Yanbo Liang	03e85be9b0	[Inductor][FX passes] New group/batch fusion pattern searching algorithm + group mm fusion + preserve memory layout (#106279 ) Summary: Major changes: * Implement a new group/batch fusion pattern searching algorithm: only fuse patterns that are in a certain depth difference (locally). * Search FX graph in reverse order since most of ops have more inputs than outputs. * Support fuse mm (linear backward) * Preserve memory layout for fbgemm.gmm. We tested in Ads models and saw consistent gains. Test Plan: Unit tests and integration test. Differential Revision: D47581710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106279 Approved by: https://github.com/jansel, https://github.com/Skylion007	2023-08-01 01:10:44 +00:00
Jackie (Jiaqi) Xu	e40f8acef2	[inductor][fx passes] batch layernom (#105492 ) Summary: Batch layernorm. Fuse independent horizontal layernorm with same size into one. Test Plan: # unit test ``` buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py Buck UI: https://www.internalfb.com/buck2/68eb51e1-bdbc-4847-aabf-e50737d8485b Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549764442206 Network: Up: 0 B Down: 0 B Jobs completed: 10. Time elapsed: 1:07.2s. Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D47447542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105492 Approved by: https://github.com/jansel, https://github.com/xuzhao9	2023-07-21 05:03:04 +00:00
Yanbo Liang	dc58259746	[Inductor] [FX passes] Group linear fusion (#105116 ) Summary: The draft version of a group + batch fusion framework, and the group linear fusion implementation. In the future, it's pretty straightforward to add a new group/batch fusion policy by defining a class with match + fuse functions. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion Differential Revision: D46956695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105116 Approved by: https://github.com/jansel	2023-07-18 03:56:42 +00:00

49 Commits