pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Nikita Shulga	5f0bd98767	Increase max total number of dynamo partitions to 15 (#134153 ) Needed to be able to split some of the aarch64 workflows to 15 shards Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-21 23:10:12 +00:00
Bin Bao	5d5a45dc85	[CI][dashboard] Collect Export pass rate separately (#134076 ) Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076 Approved by: https://github.com/angelayi	2024-08-21 21:18:55 +00:00
Pavel Belevich	a3e1416c05	Fix out_tensor device in diag_test.py (#134020 ) This benchmark fails if device='cuda' but out_tensor is on cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/134020 Approved by: https://github.com/soulitzer	2024-08-21 20:43:39 +00:00
Edward Z. Yang	32e057636c	Enable scribe environment for compile-time benchmarks if requested. (#133891 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891 Approved by: https://github.com/malfet	2024-08-21 18:02:54 +00:00
Weizhuo Zhang	5153550e4b	[CI] Add FP32 dynamic, AMP static, AMP dynamic for AOT inductor accuracy CPU CI test (#132836 ) This PR added 3 more accuracy test for AOT inductor CPU side. 1. FP32 dynamic shape accuracy test, torchbench suite 2. AMP static shape accuracy test, torchbench suite 3. AMP dynamic shape accuracy test, torchbench suite Test Time cost: \| Precision \| Shape Type \| Suite \| Time cost \| \|----------- \|------------ \|------------ \|----------- \| \| FP32 \| dynamic \| Torchbench \| 1h40m \| \| AMP \| Static \| Torchbench \| 1h38m \| \| AMP \| dynamic \| Torchbench \| 1h48m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132836 Approved by: https://github.com/desertfire	2024-08-19 14:26:48 +00:00
laithsakka	7673ee5456	remove benchmarks/__init__.py (#133390 ) trying to address https://github.com/pytorch/pytorch/issues/133377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133390 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ezyang	2024-08-15 19:08:10 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
eellison	4bb1650ca3	Bump maxinum num warps (#132458 ) Fix for https://github.com/pytorch/pytorch/issues/129104 Our heuristic for num_warps was giving the optimal number, but we were capping maximum num_warps at 8. Gives 1% speedup on HF and TIMM in inference, 2% speedup in TIMM training, neutral otherwise. ultimately, I think we want live var analysis for register usage.. still worth landing this now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132458 Approved by: https://github.com/Chillee, https://github.com/shunting314	2024-08-14 16:51:05 +00:00
laithsakka	f5e704a6f2	Add instruction count benchmark to run on pull requests (#131475 ) This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing. to access results goto test pr_time_benchmarks and inspect logs: you should see ``` + echo 'benchmark results on current PR: ' benchmark results on current PR: + cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt update_hint_regression,instruction_count,27971461254 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475 Approved by: https://github.com/ezyang	2024-08-12 05:20:26 +00:00
joydddd	4110cb6ba7	Add explicit GQA support. (#131559 ) ### tl;dr This PR adds GQA support to higher order op `flex_attention`. ## Details When `enable_gqa` is set to True, HOP `flex_attention(score_mod, query, key, value, block_mask, enable_gqa)` runs Group Query Attention(GQA), where the number of query heads (Hq) is a multiple of number of key/value heads (Hkv). Each group of query heads (`Hq//Hkv` heads) attends to a shared kv head. Otherwise, `flex_attention` assumes Multi Head Attention (MHA) where the number of query heads is equal the number of kv heads. The `score_mod` and `mask_mod` API are adapted accordingly to take `q_head` as head index. ``` def score_mod(score: torch.Tensor, batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor def mask_mod(batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor ``` ## Example ```python import torch from torch.nn.attention.flex_attention import flex_attention from torch.nn.attention.flex_attention import create_block_mask torch.manual_seed(0) def query_key_value_clones( query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, dtype: torch.dtype = None, ): """Clones the query, key, and value tensors and moves them to the specified dtype.""" if dtype is None: dtype = query.dtype query_ref = query.clone().detach().to(dtype).requires_grad_(query.requires_grad) key_ref = key.clone().detach().to(dtype).requires_grad_(key.requires_grad) value_ref = value.clone().detach().to(dtype).requires_grad_(value.requires_grad) return query_ref, key_ref, value_ref # Lets create some input tensors # The input tensor has shape (batch_size, num_heads, seq_len, head_dim). # query and key/value can have different num_heads and seq_len # Here 8 query heads share one KV head. query = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) key = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) value = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) query1, key1, value1 = query_key_value_clones(query, key, value) # Lets create a score_modification. We take alibi_bias as an example. # score_mod takes batch index, query head index, query index, and key/value index. def _generate_alibi_bias(num_kv_heads: int, num_q_heads: int): def _alibi_bias( score: torch.Tensor, b: torch.Tensor, hq: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor, ) -> torch.Tensor: # Let's calculate kv head from query head index group = num_q_heads // num_kv_heads hkv = hq // group scale = torch.exp2(-((hkv + 1) * 8.0 / num_kv_heads)) return score + (token_kv - token_q) * scale return _alibi_bias # Let's apply a casual mask on top of it def causal_mask(b, h, q, kv): return q >= kv # Generate a block mask for our new mask_mod function. # The mask is broadcasted long head & batch dimensions. block_mask = create_block_mask(causal_mask, B=1, H=1, Q_LEN=2048, KV_LEN=2048) # Lets call flex_attention with our new score modification and block mask under eager mode. output = flex_attention(query, key, value, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True) # Now lets compile flex_attention and run the flex_attention kernel. compiled_flex_attention = torch.compile(flex_attention) out_compiled = compiled_flex_attention(query1, key1, value1, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True) torch.testing.assert_close(output, out_compiled, atol=5e-2, rtol=2e-2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131559 Approved by: https://github.com/drisspg	2024-08-09 21:25:35 +00:00
Shunting Zhang	10c2168b31	[pt2-bench] use larger multiplier for smaller tensors for a few models (#132952 ) Fix https://github.com/pytorch/pytorch/issues/132922 and https://github.com/pytorch/pytorch/issues/132924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132952 Approved by: https://github.com/eellison, https://github.com/jansel	2024-08-09 00:09:21 +00:00
leslie-fang-intel	ac960dced1	Skip Reformer for Dynamic size testing (#132468 ) Summary As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API `mark_dynamic` `3a355c1891/torch/_dynamo/decorators.py (L228-L230)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468 Approved by: https://github.com/ezyang	2024-08-08 08:25:53 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
HDCharles	374747818d	Run performance test non-alternately (#131935 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). other changes: need to add torch.compiler.cudagraph_mark_step_begin() to avoid the slowdown from # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards also updated the torchao APIs to the current versions X-link: https://github.com/pytorch/benchmark/pull/2394 Test Plan: python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune (should all be ~1.0 0.997x 1.006x 0.994x Reviewed By: xuzhao9 Differential Revision: D60252821 Pulled By: HDCharles Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935 Approved by: https://github.com/xuzhao9	2024-08-08 00:23:20 +00:00
drisspg	cb4d1bfb71	Clean up some tflop calc and add option for saving (#132799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132799 Approved by: https://github.com/BoyuanFeng	2024-08-07 00:19:42 +00:00
Aaron Gokaslan	fd4b649e6c	[BE]: Simplify some list comps to generators C419 (#132578 ) Simplifies some list comprehensions to generator which is more efficient. Automatically applied diffs for the most part with ruff Pull Request resolved: https://github.com/pytorch/pytorch/pull/132578 Approved by: https://github.com/ezyang	2024-08-04 17:46:26 +00:00
Justin Chu	6966d44eda	[ONNX] Rename _internal/exporter to _exporter_legacy (#132429 ) The next PR will be creating an `exporter` directory to house logic from `torch-onnx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429 Approved by: https://github.com/titaiwangms	2024-08-03 04:23:05 +00:00
zengxian	f936e68506	[CI] Update CPU inductor smoke test model list and target (#132221 ) Fixes #132097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132221 Approved by: https://github.com/desertfire	2024-08-02 07:09:54 +00:00
drisspg	2b86a7fcc7	fix printing of scores and mods names (#132424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132424 Approved by: https://github.com/Skylion007	2024-08-02 03:30:23 +00:00
joydddd	bdd83c4c7f	Add Full block support to flex_decoding (#131404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131404 Approved by: https://github.com/yanboliang	2024-08-01 07:28:52 +00:00
Sergii Dymchenko	8458980bbf	Move benchmarks/dynamo/huggingface configuration to YAML (#131724 ) Similar to https://github.com/pytorch/pytorch/pull/120299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724 Approved by: https://github.com/shunting314	2024-07-27 00:55:04 +00:00
Bin Bao	0272934238	[Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812 ) Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory. ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module> from torch.torch_version import __version__ as __version__ File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module> from torch.version import __version__ as internal_version ModuleNotFoundError: No module named 'torch.version' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812 Approved by: https://github.com/eellison, https://github.com/malfet	2024-07-26 22:31:44 +00:00
Sergii Dymchenko	da1a1fa55f	Move load_yaml_file to common (#131924 ) This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924 Approved by: https://github.com/shunting314, https://github.com/huydhn	2024-07-26 19:47:52 +00:00
zengxian	d3e932dc10	[CI] Add inductor cpu accuracy test running on AVX2 runners (#128682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128682 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-07-26 13:24:41 +00:00
Animesh Jain	246e32055a	[benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804 ) Fixes https://github.com/pytorch/pytorch/issues/121989 We are turning on the flag by default in another PR. But that PR can go through reverts. So, forcibly adding the benchmark to prevent dashboard fluctuation in case of reverts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #131795, #131801	2024-07-26 00:20:42 +00:00
Animesh Jain	01bc2a8165	[inline-inbuilt-nn-modules] Skip mobilenet_v2 test for cpu inductor (#131694 ) Related issue https://github.com/pytorch/pytorch/issues/131693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131694 Approved by: https://github.com/eellison	2024-07-25 02:49:16 +00:00
joydddd	7f61324268	Add sparse block to flex_decoding kernel (#130884 ) fix typo Finish flex_decoding block sparse Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884 Approved by: https://github.com/drisspg	2024-07-24 20:30:25 +00:00
Justin Chu	9db567f17d	[ONNX] Set dump_exported_program to True in bench (#131670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670 Approved by: https://github.com/titaiwangms	2024-07-24 20:02:03 +00:00
Xu Zhao	4eee2e7a6d	[operator_benchmark] Remove TARGETS from broken benchmarks (#131460 ) Summary: Remove operator_benchmark caffe2 build due to the removal of caffe2: `2fd75667b4` Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain. Test Plan: Sandcastle CI Differential Revision: D60086216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460 Approved by: https://github.com/vmpuri	2024-07-23 20:06:08 +00:00
zengxian	8a591da3e7	[CI] Enable AOT inductor in cpu performance smoke test (#130097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130097 Approved by: https://github.com/chuanqi129, https://github.com/desertfire	2024-07-23 03:44:13 +00:00
PyTorch MergeBot	6cbb1437c1	Revert "Add sparse block to flex_decoding kernel (#130884 )" This reverts commit 0bf59db6cc076468f44197f0d7ee41f6204c47c2. Reverted https://github.com/pytorch/pytorch/pull/130884 on behalf of https://github.com/atalman due to Sorry reverting test_causal_full_mask_vs_sdpa constantly failing on trunk ([comment](https://github.com/pytorch/pytorch/pull/130884#issuecomment-2244113663))	2024-07-23 02:10:14 +00:00
joydddd	0bf59db6cc	Add sparse block to flex_decoding kernel (#130884 ) fix typo Finish flex_decoding block sparse Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884 Approved by: https://github.com/drisspg	2024-07-22 21:29:43 +00:00
Xu Zhao	e3eaa22126	[torchbench][multisect] Run accuracy check at Diff time (#131266 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2388 We can enable accuracy checks at Diff time since it is not a performance metric. * Refactor the existing diff time test to use the new PT2 Benchmark Runner. * Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection. Test Plan: Sandcastle CI Or buck test command: ``` buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429 Reviewed By: oulgen Differential Revision: D59825601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266 Approved by: https://github.com/oulgen	2024-07-22 20:14:28 +00:00
cyy	feef057691	[1/N] Fix Wunused-parameter warnings (#130924 ) Before we can turn Wunused-parameter into an error Pull Request resolved: https://github.com/pytorch/pytorch/pull/130924 Approved by: https://github.com/ezyang	2024-07-19 06:14:51 +00:00
joydddd	6d9f74f0af	Add flex decoding benchmark (#130850 ) ghstack-source-id: b4f26fb66ed47907b11580c8c853737959c58811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130788 Add benchmark for flex decoding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130850 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-07-18 18:09:25 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Alnis Murtovi	4b7ff35622	Fix flex_attention import in score_mod (#130906 ) torch.nn.attention._flex_attention has been renamed to torch.nn.attention.flex_attention, so the import does not work currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130906 Approved by: https://github.com/Chillee	2024-07-17 13:37:08 +00:00
Xu Zhao	1d8baa4df2	[torchbench][servicelab] Fix servicelab test failures (#130781 ) Fix servicelab test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781 Approved by: https://github.com/desertfire	2024-07-16 17:35:13 +00:00
Xu Zhao	213685ba97	[torchao][pt2 benchmark runner] Run performance test non-alternately (#130136 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 ``` ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune ``` Differential Revision: D59332736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136 Approved by: https://github.com/jerryzh168	2024-07-16 13:38:17 +00:00
eellison	9ab8d47f9d	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-16 00:17:11 +00:00
PyTorch MergeBot	9df4bc6a0d	Revert "Constant folding for dynamic shape node (#129686 )" This reverts commit b7d287fbec0a05a3d4c9524006e6bfd1de6a71a0. Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally. Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))	2024-07-15 15:19:24 +00:00
cyy	28f6ae2718	[9/N] Replace c10::optional with std::optional (#130674 ) Follows #130509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674 Approved by: https://github.com/Skylion007	2024-07-15 00:48:43 +00:00
Xuehai Pan	4d7bf72d93	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206 Approved by: https://github.com/malfet	2024-07-14 08:17:52 +00:00
titaiwangms	18418a7dbb	[ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586 ) The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586 Approved by: https://github.com/justinchuby	2024-07-12 15:47:59 +00:00
eellison	b7d287fbec	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-12 03:44:29 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
chilli	da66e50e6e	Added compile option to create_block_mask (#130106 ) Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 trillion element tensor if fully materialized. ``` print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 220, 220, _compiled=True))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106 Approved by: https://github.com/yanboliang ghstack dependencies: #130160	2024-07-06 08:09:56 +00:00
Shunting Zhang	c5ede865c4	[pt2-bench] raise tolerance for squeezenet1_1 (#130165 ) The training accuracy for this model starts to regress. It does not show up on the weekly run yet but 1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c) 2. I can repro it locally Command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend inductor --device cuda --only squeezenet1_1 ``` Raise the tolerance to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005, #130163	2024-07-06 00:49:15 +00:00
Shunting Zhang	0fcbca9adb	[pt2-bench] use eval mode for vision_maskrcnn (#130163 ) Try to fix https://github.com/pytorch/pytorch/issues/130161 The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors I fix that to always use eval mode for vision_maskrcnn training. With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005	2024-07-06 00:49:15 +00:00
Shunting Zhang	8f6765f7a7	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-05 10:26:39 +00:00

1 2 3 4 5 ...

1722 Commits