pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-04 08:00:58 +08:00

Author	SHA1	Message	Date
Xuehai Pan	ba3b05fdf3	[1/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort stdlib (#127122 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122 Approved by: https://github.com/kit1980	2024-05-25 08:25:50 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
Xu Zhao	1e818db547	[torchbench] Fix torchao benchmarking script (#126736 ) As the title says. Test Plan: ``` python benchmarks/dynamo/torchbench.py --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory cuda eval BERT_pytorch [XZ Debug] Torch grad status: False memory: eager: 0.82 GB, dynamo: 0.92 GB, ratio: 0.89 running benchmark: 100% 1.001x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126736 Approved by: https://github.com/jerryzh168, https://github.com/huydhn	2024-05-21 23:15:12 +00:00
Xu Zhao	2068dadbe8	[torchbench] Add torchao to PT2 Benchmark Runner (#126469 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2268 Support torchao performance and accuracy tests in PT2 Benchmark Runner, using the inductor backend as the baseline. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory loading model: 0it [00:50, ?it/s] cuda eval BERT_pytorch memory: eager: 0.75 GB, dynamo: 0.75 GB, ratio: 1.00 running benchmark: 100% 1.003x ``` Reviewed By: jerryzh168 Differential Revision: D57463273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126469 Approved by: https://github.com/huydhn	2024-05-20 17:53:44 +00:00
Matthew Hoffman	81277baa0c	Remove removed ruff rule TRY200 (#126256 ) My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema. From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/ > This rule has been removed and its documentation is only available for historical reasons. > > This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead. and we are currently explicitly ignoring B904. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256 Approved by: https://github.com/Skylion007	2024-05-17 16:31:05 +00:00
Stonepia	5756b53dd8	[XPU] call empty_cache for dynamo tests (#126377 ) When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models. This PR unifies the `empty_cache` call for both CUDA and XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126377 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire	2024-05-17 06:05:51 +00:00
Sam Larsen	c87c39d935	[benchmarking] Suppress csv creation on cold-start phase of --warm-start-latency (#125953 ) Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953 Approved by: https://github.com/desertfire ghstack dependencies: #125917	2024-05-15 05:32:06 +00:00
Sam Larsen	9f0d3f71c9	Adjust number of repeats when using --warm-start-latency benchmark flag (#125917 ) Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat Pull Request resolved: https://github.com/pytorch/pytorch/pull/125917 Approved by: https://github.com/desertfire	2024-05-15 05:32:06 +00:00
Sam Larsen	966ebd2e24	Add --warm-start-latency to benchmark harness (#125353 ) Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353 Approved by: https://github.com/eellison, https://github.com/desertfire	2024-05-09 21:12:15 +00:00
Yu, Guangye	d17be10df1	make torch.amp.autocast more generic (#125103 ) # Motivation As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend. # Solution When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC. # Additional Context With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`. Add two new UTs to cover this change in eager and jit path respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui	2024-05-08 12:13:26 +00:00
BowenBao	a3d97f6ce4	[ONNX] Benchmark onnx export w/ ort fusions (#125700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125700 Approved by: https://github.com/thiagocrepaldi	2024-05-08 01:10:05 +00:00
Animesh Jain	f04c8471a4	[dynamo][prepare for nn module guards] Guard nn modules for a few benchmarks (#125324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125324 Approved by: https://github.com/jansel ghstack dependencies: #125439, #125421, #124522	2024-05-04 22:08:56 +00:00
Edward Z. Yang	e93b57a570	Add propagate_real_tensors mode for unbacked (#125115 ) A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one. This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are. I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this: ``` WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False ``` Potential later follow ups: * Improve the warning messages (in particular, should provide user frames) * GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115 Approved by: https://github.com/IvanKobzarev	2024-05-02 15:28:26 +00:00
Aaron Gokaslan	e3b9b71684	[BE]: Ruff - TRY401 - Avoid verbose exception logging (#125126 ) Don't bother logging exception obj explicitly with logger, it's captured anyway and would generate verbose outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125126 Approved by: https://github.com/ezyang	2024-04-28 21:44:33 +00:00
Stonepia	3d8585e501	[XPU] Add manual_seed and synchronize method (#124709 ) This PR set the following device-specific settings for xpu(Intel GPU) specific: 1. Set the manual seed for xpu 2. Set the synchronization method for xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/124709 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-04-26 12:32:12 +00:00
Simon Fan	14430564ce	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-26 03:22:29 +00:00
PyTorch MergeBot	154157416c	Revert "[cudagraphs] add cudagraph_skips counter (#124804 )" This reverts commit fdad16b85108209bc021107f312f4b221422a012. Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
Simon Fan	fdad16b851	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-25 03:38:09 +00:00
eellison	000d55870a	Enable in oss (#124031 ) Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229, #122825	2024-04-19 20:28:55 +00:00
Sam Larsen	290e3e7abb	Add ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408 ) Summary: The intent is that we can whitelist certain benchmarks to a) enable TORCH_COMPILE_DEBUG=1, and b) save the generated artifacts in test/debug in case of a failure. Via the rules in action.yml, we can then upload test/debug/ to S3 whenever it exists. I chose to introduce a new directory (test/debug/) rather than using an existing one (e.g., test/test-reports/), because these don't seem like test reports and we can later add other debug-related artifacts if we find it useful. For example, we might want to later explore including the inductor cache artifacts. Test Plan: See artifacts generated when I force a failure: https://hud.pytorch.org/pr/124234 Specifically: https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/8729891826/1/artifact/debug-test-inductor_torchbench-2-2-linux.g5.4xlarge.nvidia.gpu_23953679574.zip Pull Request resolved: https://github.com/pytorch/pytorch/pull/124408 Approved by: https://github.com/desertfire	2024-04-19 02:46:00 +00:00
Simon Fan	7c94652d7d	[benchmarks] Add --use-warm-peak-memory (#124326 ) Measuring peak memory on the first run can capture cases where compiled artifacts leak into runtime, but it also introduces a lot of noise from cudnn/triton autotuning which generally uses as much memory as it can. Setting this flag as a default will need some discussion, so I will only add it to unblock compiled backward benchmarking (where all autotuning memory use is exposed) ``` e.g. resnet50 # without --warm-peak-memory memory: eager: 1.95 GB, dynamo: 6.68 GB, ratio: 0.29 # with --warm-peak-memory memory: eager: 1.96 GB, dynamo: 2.06 GB, ratio: 0.95 ``` ![image](https://github.com/pytorch/pytorch/assets/9547562/36cd8687-a7f7-4ec6-b989-7e1263aa7d37) This issue may also affect large models. Here's an example case of cudnn_convolution_backward autotuning allocating 30GB to tune a model otherwise using 5GB memory: ![image](https://github.com/pytorch/pytorch/assets/9547562/4e544b11-3579-4c69-811a-91d896f1ba66) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124326 Approved by: https://github.com/jansel ghstack dependencies: #119411	2024-04-18 02:57:01 +00:00
Simon Fan	0ddd17bdc6	[benchmarks] Add --snapshot-memory to get memory pickles for eager vs compiled (#119411 ) creates memory snapshot pickles e.g. ``` inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_compiled_pytorch_stargan.pickle inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_eager_pytorch_stargan.pickle ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119411 Approved by: https://github.com/jansel	2024-04-18 02:57:01 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
chunyuan	ec00daf4f1	[aotinductor] Fix benchmarks with self.autocast for run_performance_test (#123699 ) ## Pitch Similar to https://github.com/pytorch/pytorch/pull/110490 which fixes the `self.autocast` in the `check_accuracy` function, this PR fixes the `self.autocast` context in the `run_performance_test` function. ## Description The code inside `check_accuracy` after the fix on https://github.com/pytorch/pytorch/pull/110490: `a4a49f77b8/benchmarks/dynamo/common.py (L2490-L2500)` The current code on main branch before this PR in `run_performance_test` does not have the `self.autocast` context: `a4a49f77b8/benchmarks/dynamo/common.py (L2685-L2692)` For eager mode, the `model_iter_fn` (which is actually [forward_pass](`e8ad5460c0/benchmarks/dynamo/huggingface.py (L556-L558)`)) is used in [warmup](`e8ad5460c0/benchmarks/dynamo/common.py (L2690)`) and [speedup_experiment](`e8ad5460c0/benchmarks/dynamo/common.py (L648)`). The `forward_pass` has the `self.autocast` context thus it could run into BF16 when AMP is on. While for AOTInductor, we will call `export_aot_inductor` in both [warmup](`e8ad5460c0/benchmarks/dynamo/common.py (L2695)`) and [speedup_experiment](`e8ad5460c0/benchmarks/dynamo/common.py (L644-L646)`), which doesn't have the `autocast` context thus will always run into FP32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123699 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-04-11 01:40:44 +00:00
angelayi	298171df5c	[benchmark] Add namedtuple pytree serialization (#123648 ) Fixes https://github.com/pytorch/pytorch/pull/123388#issuecomment-2045289729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123648 Approved by: https://github.com/desertfire	2024-04-09 22:25:36 +00:00
Tugsbayasgalan Manlaibaatar	d78991a738	Make torch_geometric models compatible with export (#123403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403 Approved by: https://github.com/angelayi	2024-04-05 20:58:16 +00:00
PyTorch MergeBot	8c7d8f0ff2	Revert "Make torch_geometric models compatible with export (#123403 )" This reverts commit 2ffab6e663b9c6951048b8c8ba82d2cc5ca5c2fc. Reverted https://github.com/pytorch/pytorch/pull/123403 on behalf of https://github.com/atalman due to Related issue basic_gnn_gin ([comment](https://github.com/pytorch/pytorch/pull/123403#issuecomment-2039817292))	2024-04-05 13:34:41 +00:00
Tugsbayasgalan Manlaibaatar	2ffab6e663	Make torch_geometric models compatible with export (#123403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403 Approved by: https://github.com/angelayi	2024-04-05 05:26:01 +00:00
Angela Yi	482d8bf1ea	[aoti] Change aot_compile callsites (#122225 ) Summary: Replacing `torch._export.aot_compile` callsites with ``` ep = torch.export._trace._export(.., predispatch=True) # Traces the given program into predispatch IR so_path = torch._inductor.aot_compile_ep(ep, ...) # Takes an exported program and compiles it into a .so ``` This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better. Test Plan: CI Differential Revision: D54808612 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225 Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov	2024-03-29 21:34:20 +00:00
eellison	ba69dc6675	[Easy] add option to print compilation time (#121996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996 Approved by: https://github.com/davidberard98	2024-03-18 22:42:41 +00:00
Animesh Jain	cd1751b14f	[dynamo] Measure Dynamo cache latency lookup (#121604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604 Approved by: https://github.com/jansel ghstack dependencies: #121614, #121622	2024-03-12 17:09:11 +00:00
James Wu	ae22bdaefe	Update torchbench commit pin, add sam_fast benchmark (#121420 ) After this, the sam_fast benchmark can now be run in the pytorch repo: ``` SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast ``` sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420 Approved by: https://github.com/oulgen, https://github.com/msaroufim	2024-03-11 19:48:53 +00:00
Sun, Jiayi	ee557d8f61	skip detectron2_fcos_r_50_fpn in dynamic shape test (#120697 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR. * Error msg is ``` File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` * Root Cause is Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. However, the inputs of `detectron2_fcos_r_50_fpn` are as follows: ``` ([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124., 82., ..., 3., 4., 5.], [125., 104., 65., ..., 3., 3., 4.], [ 87., 68., 34., ..., 2., 2., 2.], ..., [ 47., 45., 41., ..., 45., 45., 45.], [ 46., 44., 40., ..., 44., 45., 46.], [ 46., 44., 40., ..., 43., 45., 46.]], [[154., 129., 84., ..., 3., 4., 5.], [133., 110., 69., ..., 3., 3., 4.], [ 95., 76., 43., ..., 2., 2., 2.], ..., [ 44., 42., 38., ..., 34., 37., 39.], [ 43., 41., 37., ..., 35., 39., 41.], [ 43., 41., 37., ..., 35., 40., 43.]], [[171., 140., 85., ..., 3., 4., 5.], [147., 120., 71., ..., 3., 3., 4.], [103., 83., 47., ..., 2., 2., 2.], ..., [ 46., 44., 40., ..., 16., 20., 22.], [ 45., 43., 39., ..., 17., 22., 26.], [ 45., 43., 39., ..., 18., 24., 28.]]])}, ... ],) ``` None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire	2024-03-05 12:12:18 +00:00
PyTorch MergeBot	368f242e37	Revert "[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454 )" This reverts commit 8c2e569928a200893fe971e615b82a2f9ce32630. Reverted https://github.com/pytorch/pytorch/pull/120454 on behalf of https://github.com/desertfire due to breaks nightly dashboard cudagraphs run ([comment](https://github.com/pytorch/pytorch/pull/120454#issuecomment-1975001824))	2024-03-03 02:58:47 +00:00
Shunting Zhang	c4ed456fc3	[inductor] fix accuracy failure for a few models under freezing (#121054 ) Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn. For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass. For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now. One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054 Approved by: https://github.com/eellison	2024-03-02 04:53:59 +00:00
Chien-Chin Huang	8c2e569928	[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454 ) With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model. Differential Revision: [D54094257](https://our.internmc.facebook.com/intern/diff/D54094257/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120454 Approved by: https://github.com/yf225, https://github.com/xmfan	2024-03-01 08:35:22 +00:00
leslie-fang-intel	950b484356	skip three pyhpc models with dynamic shape test (#120599 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR. * Error msg is ``` File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 1048576 ``` * Root Cause is * Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](`26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16)`) ``` shape = ( math.ceil(2 * size ** (1/3)), math.ceil(2 * size ** (1/3)), math.ceil(0.25 * size ** (1/3)), ) ``` * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (`c617e7b407/benchmarks/dynamo/common.py (L3456)`) and `math.ceil(2 * size ** (1/3))` happens equaling to 4. * Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-02-29 00:38:06 +00:00
leslie-fang-intel	c617e7b407	Add resnet50/mobilenet_v2_quantized_qat in into deterministic_algorithms exclusive list (#120384 ) After PR: https://github.com/pytorch/pytorch/pull/120026, 2 `Torchbench` testcases: `resnet50_quantized_qat` and `mobilenet_v2_quantized_qat` can pass the performance testing but failed with accuracy test. The failure msg is: `mobilenet_v2_quantized_qat, RuntimeError: quantized_resize_cpu_ does not have a deterministic implementation but you set 'torch.use_deterministic_algorithms(True)'. ` - `torch.use_deterministic_algorithms(True)` only setting for accuracy test. `fff9d98e58/benchmarks/dynamo/common.py (L3480)` - However, `quantized_resize_cpu_` only support `nondeterministic_algorithms` because the resized output memory may be uninitialized. `fff9d98e58/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp (L85-L87)` Add these 2 models into the deterministic_algorithms exclusive model list in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120384 Approved by: https://github.com/desertfire, https://github.com/jgong5	2024-02-26 05:05:43 +00:00
Chien-Chin Huang	c0e5cca4f8	[DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437 ) Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437 Approved by: https://github.com/wconstab, https://github.com/xmfan	2024-02-13 16:53:56 +00:00
BowenBao	30f43e3d89	[ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710 ) Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device. This PR modifies the script to deepcopy and export the model on another device when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710 Approved by: https://github.com/thiagocrepaldi	2024-01-31 23:03:39 +00:00
Simon Fan	ed0ec2e0be	Remove dynamo runner's dependency on distributed build (#117903 ) So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903 Approved by: https://github.com/xuzhao9	2024-01-24 06:51:14 +00:00
Bin Bao	4d625c1c92	[AOTI] Fix a bug in the torch._export.aot_load API (#118039 ) Summary: tree_flatten_spec should use args instead of *args clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes Test Plan: CI Differential Revision: D52982401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039 Approved by: https://github.com/angelayi	2024-01-23 14:54:02 +00:00
Michael Lazos	f302a0d380	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-19 04:28:50 +00:00
Bin Bao	26956980c6	[AOTI] Add torch._export.aot_load (#117610 ) Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable. Test Plan: CI Differential Revision: D52825456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610 Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78	2024-01-18 15:02:16 +00:00
PyTorch MergeBot	b0084be114	Revert "Re-enable SGD (#117434 )" This reverts commit e7fac72be75a9fa7a31c6fc8062364fdfc4aaa3a. Reverted https://github.com/pytorch/pytorch/pull/117434 on behalf of https://github.com/lezcano due to breaks test_profiler.py when run with dynamo ([comment](https://github.com/pytorch/pytorch/pull/117434#issuecomment-1898311961))	2024-01-18 11:37:36 +00:00
Michael Lazos	e7fac72be7	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-18 06:47:15 +00:00
Simon Fan	4b25948ee6	Torchbench Dynamo Runner: Enable DDP for perf test and traces (#113332 ) - Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp` - Append rank name to traces to avoid all ranks trying to create the same file - Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2024-01-12 22:41:09 +00:00
Simon Fan	88bf84f106	[benchmark] add --compile-autograd to dynamo benchmarks (#117196 ) Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats e.g. accuracy_inductor.csv ``` dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1 cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0 cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0 cuda,LearningToPaint,4,pass,639,2,8,7,1,1 ... ``` e.g. speedup_inductor.csv ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1 cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196 Approved by: https://github.com/jansel	2024-01-11 20:12:58 +00:00
Bin Bao	7e9cbc6834	[CI] Catch more exception types when running eager in PT2 tests (#117120 ) Summary: https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1332 shows a case where model loading fails with KeyError but the error is not logged in the report csv file, which can cause an eager model failure silently ignored in the PT2 integration test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117120 Approved by: https://github.com/huydhn	2024-01-11 17:46:11 +00:00
Bin Bao	b8374314cc	[AOTI] Update AOTI runner util (#116971 ) Summary: Update the runner used in integration tests after https://github.com/pytorch/torchrec/pull/1604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116971 Approved by: https://github.com/chenyang78	2024-01-09 19:07:54 +00:00

1 2 3 4 5 ...

409 Commits