pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Edward Z. Yang	dd3a77bc96	Apply UFMT to all files in benchmarks/ (#105928 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928 Approved by: https://github.com/albanD	2023-07-26 01:18:48 +00:00
Justin Chu	5ef023b05a	[BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429 Approved by: https://github.com/malfet	2023-07-19 04:46:37 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Yulv-git	ac2d2e3a3d	Fix some typos. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/75561 Approved by: https://github.com/albanD	2022-04-11 21:55:59 +00:00
Elias Ellison	6694fdaccd	Clean up profiling mode and profiling executor strategy (#73875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73875 Previously we had a few settings: - getExecutor - which toggled between Profiling Executor and Legacy - getGraphOptimize - if true, overrides PE/Legacy to run with simple executor (no optimizations) and then... - getProfilingMode - which would set PE to 0 specializtions. The last mode is redundant with getGraphOptimize, we should just remove it and use getGraphOptimize in these cases. It would lead to potentially invalid combinations of logic - what does mean if getProfilingMode is true but getExecutor is set to false ? This would lead to a bug in specialize_autograd_zero in this case, see: https://github.com/pytorch/pytorch/blob/master/torch%2Fcsrc%2Fjit%2Fpasses%2Fspecialize_autogradzero.cpp#L93. The tests here are failing but get fixed with the PR above it, so i'll squash for landing. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D34938130 Pulled By: eellison fbshipit-source-id: 1a9c0ae7f6d1cfddc2ed3499a5af611053ae5e1b (cherry picked from commit cf69ce3d155ba7d334022c42fb2cee54bb088c23)	2022-03-29 18:38:51 +00:00
Raghavan Raman	d3cde6c23c	[NNC] Implementation for aten::cat without conditionals. (#53128 ) Summary: This PR adds an implementation for `aten::cat` in NNC without any conditionals. This version is not enabled by default. Here is the performance of some micro benchmarks with and without conditionals. There is up to 50% improvement in performance without conditionals for some of the shapes. aten::cat implementation in NNC with conditionals ``` $ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion concat pt: concat2d2input_fwd_cpu_1_160_1_14_1: 5.44 us, SOL 0.26 GB/s, algorithmic 0.51 GB/s pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.75 us, SOL 1.05 GB/s, algorithmic 2.10 GB/s pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.87 us, SOL 4.05 GB/s, algorithmic 8.11 GB/s pt: concat2d2input_fwd_cpu_20_580_20_174_1: 14.52 us, SOL 8.31 GB/s, algorithmic 16.62 GB/s pt: concat2d2input_fwd_cpu_8_512_8_512_1: 9.58 us, SOL 6.84 GB/s, algorithmic 13.68 GB/s ``` aten::cat implementation in NNC without conditionals ``` $ python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion --cat_wo_conditionals concat pt: concat2d2input_fwd_cpu_1_160_1_14_1: 4.67 us, SOL 0.30 GB/s, algorithmic 0.60 GB/s pt: concat2d2input_fwd_cpu_1_580_1_174_1: 5.65 us, SOL 1.07 GB/s, algorithmic 2.14 GB/s pt: concat2d2input_fwd_cpu_20_160_20_14_1: 6.10 us, SOL 4.56 GB/s, algorithmic 9.12 GB/s pt: concat2d2input_fwd_cpu_20_580_20_174_1: 7.44 us, SOL 16.22 GB/s, algorithmic 32.44 GB/s pt: concat2d2input_fwd_cpu_8_512_8_512_1: 6.46 us, SOL 10.14 GB/s, algorithmic 20.29 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/53128 Reviewed By: bertmaher Differential Revision: D26758613 Pulled By: navahgar fbshipit-source-id: 00f56b7da630b42bc6e7ddd4444bae0cf3a5780a	2021-03-07 22:57:02 -08:00
Raghavan Raman	8af648354f	[nnc] Benchmarks for concat (#52592 ) Summary: This PR adds a c++ benchmark for "concat" with 3 different versions - 1) aten::cat, 2) NNC implementation with if-then-else, 3) NNC implementation using multiple loops. It also adds a python benchmark for "concat" which can now be invoked with and without CPU fusion. Here are the results of these benchmarks on a `Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz` machine with `OMP_NUM_THREADS=1` ``` -------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------------------------------- Concat2D2 (`678fe9f077`)Input/ATen/1/160/1/14/1 1211 ns 1211 ns 567896 GB/s=1.14953G/s Concat2D2 (`678fe9f077`)Input/ATen/1/580/1/174/1 1296 ns 1296 ns 537060 GB/s=4.65362G/s Concat2D2 (`678fe9f077`)Input/ATen/20/160/20/14/1 1823 ns 1823 ns 382052 GB/s=15.2677G/s Concat2D2 (`678fe9f077`)Input/ATen/20/580/20/174/1 3347 ns 3347 ns 210036 GB/s=36.0432G/s Concat2D2 (`678fe9f077`)Input/ATen/8/512/8/512/1 2093 ns 2093 ns 324760 GB/s=31.3061G/s Concat2D2 (`678fe9f077`)Input/NNC/1/160/1/14/1 694 ns 694 ns 1002902 GB/s=2.00692G/s Concat2D2 (`678fe9f077`)Input/NNC/1/580/1/174/1 852 ns 852 ns 803002 GB/s=7.08127G/s Concat2D2 (`678fe9f077`)Input/NNC/20/160/20/14/1 1639 ns 1639 ns 419683 GB/s=16.9828G/s Concat2D2 (`678fe9f077`)Input/NNC/20/580/20/174/1 5956 ns 5956 ns 117833 GB/s=20.2548G/s Concat2D2 (`678fe9f077`)Input/NNC/8/512/8/512/1 3136 ns 3136 ns 224122 GB/s=20.8958G/s Concat2D2 (`678fe9f077`)Input/NNCLoop/1/160/1/14/1 581 ns 581 ns 1209873 GB/s=2.39737G/s Concat2D2 (`678fe9f077`)Input/NNCLoop/1/580/1/174/1 614 ns 614 ns 1132332 GB/s=9.82955G/s Concat2D2 (`678fe9f077`)Input/NNCLoop/20/160/20/14/1 1091 ns 1091 ns 622952 GB/s=25.5247G/s Concat2D2 (`678fe9f077`)Input/NNCLoop/20/580/20/174/1 2399 ns 2399 ns 288376 GB/s=50.289G/s Concat2D2 (`678fe9f077`)Input/NNCLoop/8/512/8/512/1 1500 ns 1500 ns 478360 GB/s=43.6968G/s Concat2D3 (`e23ddf06e9`)Input/ATen/8/512/8/512/8/512/1 2584 ns 2584 ns 266394 GB/s=38.0397G/s Concat2D3 (`e23ddf06e9`)Input/NNC/8/512/8/512/8/512/1 5056 ns 5056 ns 139768 GB/s=19.4416G/s Concat2D3 (`e23ddf06e9`)Input/NNCLoop/8/512/8/512/8/512/1 1917 ns 1917 ns 369626 GB/s=51.2758G/s Concat2D7 (`b5edf329f8`)Input/ATen/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1 3888 ns 3888 ns 178124 GB/s=46.3571G/s Concat2D7 (`b5edf329f8`)Input/NNC/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1 24639 ns 24638 ns 28336 GB/s=7.31481G/s Concat2D7 (`b5edf329f8`)Input/NNCLoop/8/128/8/256/8/384/8/512/8/512/8/512/8/512/1 3093 ns 3093 ns 226326 GB/s=58.265G/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/52592 Reviewed By: bertmaher Differential Revision: D26596701 Pulled By: navahgar fbshipit-source-id: 650fa88febf4423ea49f5a1d3d734edc2294d257	2021-02-24 06:09:32 -08:00
Raghavan Raman	b6ed05130e	Adding a flag to enable CPU fusion in benchmarks (#48612 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48612 Test Plan: python -m benchmarks.tensorexpr --device cpu --mode fwd --jit_mode trace --cpu_fusion element Reviewed By: heitorschueroff Differential Revision: D26548643 Pulled By: navahgar fbshipit-source-id: adb537818d77c9b6b0fe434ae6d963a5f348ad24	2021-02-19 12:11:06 -08:00
Raghavan Raman	12d85b536e	Fixing Softmax bench. (#51898 ) Summary: Fixes and enables the microbenchmark for Softmax. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51898 Reviewed By: gmagogsfm Differential Revision: D26333189 Pulled By: navahgar fbshipit-source-id: be0934e413c4f6728593f896e53a0b31f1657e52	2021-02-09 15:03:49 -08:00
Xiaoqiang Zheng	88b36230f5	Add full reduction benchmark. (#50057 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50057 As part of the effort to calibrate TE reduction performance, adding a full reduction benchmark. Also add a "skip_input_transformation" option. Fixed other reduction benchmarks to accept specific benchmarks that was listed. Test plans: * python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full * python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full_fwd_cpu_16777216_s1 * python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full_fwd_cpu_16777216_s0 * python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_inner * python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_inner_fwd_cpu_640_524288 * python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_outer * python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_outer_fwd_cpu_640_524288 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D25774138 Pulled By: zheng-xq fbshipit-source-id: fd4598e5c29991be476e42235a059e8021d4f083	2021-01-21 09:56:46 -08:00
shmsong	56a3831bc6	[NVFuser]Benchmark minor update (#46778 ) Summary: This is a tiny PR for two minor fixes: 1. Added `torch._C._jit_set_texpr_fuser_enabled(False)` to enable shape inference on nv fuser runs. 2. Renamed dynamic benchmark module names to avoid multiple matching. i.e. `simple_element` with `dynamic_simple_element`. I guess it'd be much easier if the pattern matching was based on `startswith`. Would be happy to update that if agreed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46778 Reviewed By: zhangguanheng66 Differential Revision: D24516911 Pulled By: bertmaher fbshipit-source-id: 839f9a3e058f9d7aca17b2e6eb8b558e0e48e8f4	2020-10-26 12:22:36 -07:00
shmsong	43fe45ab0f	[JIT] Add dynamic shape benchmark for NV Fuser (#46107 ) Summary: This PR modifies `benchmarks/tensorexpr`. It follows up[ https://github.com/pytorch/pytorch/issues/44101](https://github.com/pytorch/pytorch/pull/44101) and further supports characterizing fusers with dynamic shape benchmarks. Dynamic shape condition models the use case when the input tensor shape changes in each call to the graph. Changes include: Added an auxiliary class `DynamicShape `that provides a simple API for enabling dynamic shapes in existing test cases, example can be found with `DynamicSimpleElementBench` Created new bench_cls: `DynamicSimpleElementBench`, `DynamicReduce2DInnerBench`, `DynamicReduce2DOuterBench`, and `DynamicLSTM`. They are all dynamic shaped versions of existing benchmarks and examples of enabling dynamic shape with `DynamicShape`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46107 Reviewed By: glaringlee Differential Revision: D24229400 Pulled By: bertmaher fbshipit-source-id: 889fece5ea87d0f6f6374d31dbe11b1cd1380683	2020-10-09 22:09:21 -07:00
Kevin Stephano	26a91a9f04	[WIP][JIT] Add benchmarking support of NV Fuser with FP16 dtype support (#44101 ) Summary: Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler. This support has some modifications besides adding an option to support the NVIDIA fuser: * Adds FP16 Datatype support * Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes * Adds IR printing and kernel printing knobs * Adds a knob `input_iter` to create ranges of inputs currently only for reductions * Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob. * Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise and reduction operations in the most minimal fashion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101 Reviewed By: ngimel Differential Revision: D23713658 Pulled By: bertmaher fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1	2020-09-15 15:10:49 -07:00
Bert Maher	33d51a9b32	Respect canFuseOn{CPU,GPU} in TE fuser (#43967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43967 Test Plan: Imported from OSS Reviewed By: asuhan Differential Revision: D23469048 Pulled By: bertmaher fbshipit-source-id: 1005a7ae08974059ff9d467492caa3a388070eeb	2020-09-02 18:00:25 -07:00
Bert Maher	b8ae563ce6	Add a microbenchmark for LSTM elementwise portion (#42901 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42901 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D23079714 Pulled By: bertmaher fbshipit-source-id: 28f8c3b5019ee898e82e64a0a674da1b4736d252	2020-08-12 17:11:47 -07:00
Bert Maher	33d209b5f4	Fix TE microbenchmark harness to use appropriate fuser/executor (#42900 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42900 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D23079715 Pulled By: bertmaher fbshipit-source-id: 6aa2b08a550835b7737e355960a16a7ca83878ea	2020-08-12 17:11:44 -07:00
Mikhail Zolotukhin	9fe3b1857d	[TensorExpr] Fix imports in tensorexpr benchmarks. (#35830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35830 Test Plan: Imported from OSS Differential Revision: D20799464 Pulled By: ZolotukhinM fbshipit-source-id: 1b5981ad15042f601a9b6eb01a799cdf71200666	2020-04-01 14:23:33 -07:00
Bram Wasti	a3e10d2a17	Expose enablement of TensorExpr fuser as env variable (#35341 ) Summary: This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/ ``` PYTORCH_TENSOREXPR=1 python benchmark.py ``` This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser" Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341 Reviewed By: ZolotukhinM Differential Revision: D20676348 Pulled By: bwasti fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464	2020-03-26 14:31:57 -07:00
Mikhail Zolotukhin	8998a1b3d3	Add tensorexpr benchmarks. (#35064 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35064 Test Plan: Imported from OSS Differential Revision: D20543695 Pulled By: ZolotukhinM fbshipit-source-id: 1cf294ab19465cb93557c2b195252c739b40a0f7	2020-03-20 12:01:31 -07:00

21 Commits